Simple and Distributed Machine Learning
OpenAIChatCompletion
transformer (#1887)OpenAIChatCompletion
transformer (#1887)This list of changes was auto generated.
ChatGPT and GPT-4 at Scale | Simple Deep Learning | LightGBM v2 |
Intelligent chat and embeddings. Simplified Prompting APIs. | Train custom image and text classifiers with ease | Higher performance, >10x lower memory footprint, same API |
View Notebook | Learn More | Try an example |
ONNX Model Hub | Causal Learning | Vowpal Wabbit v2 |
Embed >150 state of the art deep networks into your pipelines | Discover and measure causal treatment effects | New second generation integration |
Learn More | View Docs | Explore Samples |
DeepTextClassifier
a simple API for fine tuning a wide array of Hugging Face 🤗 text transformers using PyTorch Lightning (#1591)DeepVisionClassifier
a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518)SpeakerEmotionInference
transformer to generate emotion annotation tags for emotive reading in SpeechToText
(#1691)SpeechToTextSDK
and ConversationTranscription
(#1801)descriptionExcludes
parameter to AnalyzeImage (#1590)DoubleMLEstimator
for learning causal treatment effects from data (#1715)passThroughArgs
feature which allows users to set low level LGBM parameters before they are wrapped in SparkML (#1749)toNDArray
(#1592)modelVersion
param in TextAnalytics (#1756)DotnetTestBase
assembly version (#1601)executionMode
parameter (#1779)synapse-internal
to platform detector function (#1651)onnx
namespace (#1711)We are excited to highlight the contributions of the following SynapseML contributors:
Scott Votaw | Serena Ruan | Haizhou (Dylan) Wang |
Scott Votaw is a Principal Engineer on the SynapseML team has solved some of SynapseML’s toughest challenges in record time. In this release, Scott contributed both the new LightGBM streaming execution mode, and fully replaced our deep learning stack with the ONNX Runtime. These efforts were massive lifts including huge changes to the LightGBM native libraries and complex dependency management jujitsu respectively. Scott brings his love for the craft to every project he works on so keep your eyes peeled for more amazing feats of engineering from him in future releases. | Serena is a Software Engineer II on the SynapseML team and operates on a separate plane of existence than the rest of us mere mortals. Following up on prior major contributions like .NET support, form recognition, translation, and creating the SynapseML Website, Serena contributed the Simple Deep Learning package for this release. This package makes it easy to train modern deep text and vision networks from Hugging Face and torchvision on Spark clusters. Serena seeks only the most difficult engineering challenges and her contributions have laid the groundwork for many more deep-learning based algorithms in SynapseML. | Haizhou (Dylan) is a Senior Software engineer in the CSX Data team and a first-time contributor to the SynapseML library. Dylan contributed the new SynapseML causal learning package for the v0.11 release. This package helps users discover the effectiveness of things like medical treatments or economic policies even without controlled experiments. With his elegant contributions, Dylan has laid the foundation for more causal collaborations with the EconML library. |
Markus Cozowicz | Brendan Walsh | Jessica Wang |
Markus is a Principal Applied Scientist who (just!) joined the SynapseML team. Despite only recently coming on board officially, Markus has long been a prolific contributor to the library and built the Vowpal Wabbit and Isolation Forest integrations. In this release, Markus contributed the second generation of the Vowpal Wabbit integration, improving its generality and applicability. He also expanded the OpenAI integration to support embeddings and simplified prompt templating. Our team is incredibly lucky to have such a consistent and thoughtful collaborator. | Brendan is a Senior Engineer on the SynapseML team who recently joined after a long tenure on the Cognitive Services team where he developed their containerized cognitive service effort and co-authored the SynapseML publication on large-scale microservices. Brendan used this expertise to onboard Emotion Detection for text to speech models. He then went on to use this new emotive reading capability to create and donate thousands of audiobooks to the open source. You can learn more about Brendan’s awesome technical philanthropy efforts at https://aka.ms/audiobook. | Jessica is Software Engineer who recently joined the SynapseML team. Already, Jessica has grown into the role of the SynapseML benevolent “doc”tator. This release Jessica has worked hard to ensure that the SynapseML notebooks work across a wide variety of Spark platforms and are easy and simple to get started with. This work requires knowledge of the entire library’s surface area, and we are thankful Jessica has worked so hard to learn this breadth of content. If you have been following notebook examples from https://aka.ms/spark you have Jessica to thank! |
Kyle Rush | Avrilia Floratou | Jason Wang |
Kyle is a Senior Software Engineer on the SynapseML team with a penchant for architecture and a streak of taking on big responsibility behind the scenes. Kyle has been instrumental in expanding our testing infrastructure to new platforms so that the lights stay on even as the number of contributions increases. This often requires nontrivial code and delicate cross-team collaboration, and Kyle has both the engineering might and the charismatic finesse to make sure these systems can be spun up successfully. | Avrilia is Principal Scientist Manager in the Grey Systems Lab, first-time SynapseML contributor, and a delightful collaborator. This release, Avrilia contributed the first prototype of the simplified OpenAI prompting transformer. This contribution makes it easy to ask ChatGPT and other LLMs questions about large datasets and to create new LLM-derived columns in databases. You can learn more about her work through the OpenAI Docs and prompting demo | Jason Wang is a Principal Software Engineering on the CSX Data team and has a long history of not only contributing huge features to SynapseML, but actively maintaining his contributions. This release, Jason’s work on the ONNX model hub protocol enables quick access to over >150 pretrained deep networks from the Java and Scala ecosystems. Jason has also been instrumental in fixing the most difficult and arduous bugs, some even stemming from the core Spark runtime. Finally, we deeply appreciate Jason’s leadership in the community: he consistently encourages and helps others contribute, and his impact extends far beyond his own personal contributions. |
We would like to acknowledge the developers and contributors, both internal and external, who helped create this version of SynapseML
Eric Dettinger, Markus Weimer, Serena Ruan @serena-ruan, Scott Votaw @svotaw, Haizhou (Dylan) Wang @dylanw-oss, Puneet Pruthi @ppruthi, Markus Cozowicz @eisber, Brendan Walsh @BrendanWalsh, Jessica Wang @JessicaXYWang, Kyle Rush @k-rush, Avrilia Floratou, Jason Wang @memoryz, Mark Niehaus @niehaus59, Keerthi Yanda @KeerthiYandaOS, Ilya Matiach @imatiach-msft, Kashyap Patel @ms-kashyap, Martha Laguna @martthalch @marthalc, Sarah Shy @sarahshy, @ocworld, @adityakode, @nightscape, Alexandra Savelieva @alsavelv, Tom Finley, Jeff Zheng, James Verbus @jverbus, Chris Hoder, Misha Desai, Nellie Gustafsson, Eren Orbey, Beverly Kodhek, Louise Han @jr-MS, Raj Rikhy, Marcos Campos, Mike Estee, Brice Chung, Justyna Lucznik, Kim Manis, Mitrabhanu Mohanty, Bogdan Crivet, Anand Raman, William T. Freeman, Akshaya Annavajhala (AK), Guolin Ke, Spark.NET Team, ONNX Team, Azure Global, Vowpal Wabbit Team, LightGBM Team, MSFT Garage Team, MSR Outreach Team, Speech SDK Team, MLflow Team
This list of changes was auto generated.
synapse-internal
to platform detector function (#1651)We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n
synapse-internal
to platform detector function (#1651)This list of changes was auto generated.
toNDArray
(#1592)descriptionExcludes
parameter to AnalyzeImage (#1590)DeepVisionClassifier
a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518)We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n
toNDArray
(#1592)descriptionExcludes
parameter to AnalyzeImage (#1590)DeepVisionClassifier
a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518)This list of changes was auto generated.
OpenAI Language Models | .NET, C#, and F# Support | Full MLFlow Support | Live Demos in Browser |
Embed 175-billion parameter models into your databases with ease | Use or train any SynapseML model from .NET | Quick and easy MLOps, model management, and autologging | Explore the SynapseML library with zero setup |
Learn More | Getting Started Guide | Explore the Docs | Run in Browser |
setServiceName
python method in OpenAI (#1498)useSingleDataset
mode (#1527)singleDatasetMode
(#1458)LightGBMRanker
(#1368)useSingleDatasetMode
(#1562)saveNativeModel
for the VWRegressionModel #1364 (#1366)itsdangerous
as a dependency to ADB tests (#1412)DataConversion
serialization (#1505)TestBase
(#1501)GridSpace
python API (#1470)ComputeModelStatistics
output and convert scoredLabelsCol
to DoubleType (#1361)We are excited to highlight the contributions of the following SynapseML contributors:
Serena Ruan | Ric Serradas | Puneet Pruthi |
Serena is a Software Engineer II on the Synapse team in Beijing and a force of nature. In this release, Serena has continued her prolific contribution steak by adding language support for .NET, C#, and F# and integrating SynapseML with MLFlow. Additionally, Serena has contributed several features to the MLFlow and Spark.NET open-source communities so that these systems can work better for every user. These contributions are just some of the many amazing things Serena has accomplished during this release, and her devotion and craft are pivotal to the ecosystem. | Ric is a Senior Engineering Manager on the OneNote team with a shining personality and drive to collaborate. In just a few weeks Ric hit the ground running by setting up an automated link between GitHub and Azure DevOps, building the first working version of SynapseE2E tests, and re-writing our entire build in GH Actions. Furthermore, Ric worked tirelessly through nights and weekends to land his contributions. | Puneet is a Senior Engineer on the SynapseML team with a knack for engineering systems and dockerization. Puneet's contributions to the library include architecting the new binder integration, driving our Synapse E2E tests to completion, and improving SynapseML’ s infrastructure around community engagement. Puneet is constantly thinking of ways to improve the community and we value his effort. |
Mark Niehaus | Keerthi Yanda | Yagna Oruganti |
Mark is a Senior Software Engineer on the SynapseML team with a deep knowledge of the .NET ecosystem and infrastructure development. In this release, Mark architected SynapseML’ s .NET binding blob publishing strategy, drove the OpenAI GPT-3 bindings to completion, and wrote a detailed GPT-3 walkthrough. Mark completed these projects while supporting the Time Series Insights service, speaking to his ability to keep multiple plates spinning at a time. | Keerthi is a Software Engineer II on the SynapseML team. Despite joining Microsoft just a few months ago, Keerthi has quickly learned the SynapseML ropes to take command of our integration with the Azure Synapse platform. Huge kudos to her for braving long build times, and daunting error messages to make sure SynapseML works out of the box on Synapse Analytics clusters. | Yagna is a Senior Data and Applied Scientist on the Industry AI team with a talent for building solutions that integrate many community tools to solve customer challenges. Yagna's first contribution to SynapseML was a masterpiece of a demo showing how to use Isolation Forests, MLFlow, Tabular SHAP, and the interpret-ml explanation dashboard in a single anomaly detection example. |
We would like to acknowledge the developers and contributors, both internal and external, who helped create this version of SynapseML
Serena Ruan @serena-ruan, Eric Dettinger, Scott Votaw @svotaw, Puneet Pruthi @ppruthi, Ric Serradas @riserrad, Mark Niehaus @niehaus59, Kyle Rush @k-rush, Keerthi Yanda @KeerthiYandaOS, Yagna Oruganti @YagnaDeepika, Jason Wang @memoryz, Ilya Matiach @imatiach-msft, Yazeed Alaudah @yalaudah, Elena Zherdeva @ezherdeva, Kashyap Patel @ms-kashyap, Martha Laguna @martthalch @marthalc, Alex Li @liyzcj, Maria Guirguis @maguir, Alexandra Savelieva @alsavelv, @netang, Sudhindra Kovalam @SudhindraKovalam, Markus Cozowicz @eisber, Tom Finley, Markus Weimer, Jeff Zheng, James Verbus @jverbus, Chris Hoder, Misha Desai, Nellie Gustafsson, Eren Orbey, Beverly Kodhek, Louise Han @jr-MS, Justyna Lucznik, Kim Manis, Mitrabhanu Mohanty, Bogdan Crivat, Anand Raman, William T. Freeman, James Montemagno, Luis Quintanilla, Dennis Kennedy, Ryan Hurey, Jarno Ensio, Brian Mouncer, Steve Suh @suhsteve, Akshaya Annavajhala (AK), Guolin Ke, Tara Grumm, Niharika Dutta @Niharikadutta, Andrew Fogarty, Juanyong Duan, Weichen Xu @WeichenXu123, Spark.NET Team, ONNX Team, Azure Global, Vowpal Wabbit Team, LightGBM Team, MSFT Garage Team, MSR Outreach Team, Speech SDK Team, MLflow Team
Visit our website for the latest docs, demos, and examples | Read more about SynapseML's GA release in the Microsoft Research Blog | Learn more about our .NET bindings and code generation system. |
Watch a demonstration of SynapseML to create a multilingual search engine. | Read our Paper from IEEE Big Data '21 | Explore our integration with the Azure OpenAI Service |
EnsembleByKey
, Cacher
Timer
; see the documentation.Miniconda version 4.3.21, including Python 3.6.
CNTK version 2.1, using Maven Central.
Use OpenCV from the OpenPnP project from Maven Central.
Spark's binaryFiles
function had a regression in version 2.1 from
version 2.0 which would lead to performance issues; work around that
for now. Data frame operations after a use of BinaryFileReader
(eg,
reading images) are significantly faster with this.
The Spark installation is now patched with hadoop-azure
and
azure-storage
.
Includes additional bug fixes and improvements.
We are now uploading MMLSpark as a Azure/mmlspark
spark package.
Use --packages Azure:mmlspark:0.8
with the Spark command-line tools.
Add a bi-directional LSTM medical entity extractor to the
ModelDownloader
, and new jupyter notebook for medical entity
extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM.
Add ImageSetAugmenter
for easy dataset augmentation within image
processing pipelines.
Optimize the performance of CNTKModel
. It now broadcasts a loaded
model to workers and shares model weights between partitions on the
same worker. Minibatch padding (an internal workaround of a CNTK bug)
is now no longer used, eliminating excess computations when there is a
mismatch between the partition size and minibatch size.
Bugfix: CNTKModel can work with models with unnamed outputs.
Environment variables are now part of the docker image (in addition to being set in bash).
New docker images:
microsoft/mmlspark:latest
: plain image, as always,microsoft/mmlspark:gpu
: GPU variant based on an nvidia/cuda
image.microsoft/mmlspark:plus
and microsoft/mmlspark:plus-gpu
: these
images contain additional packages for internal use; they will
probably be based on an older Conda version too in future releases.The Conda environment now includes NLTK.
Updated Java and SBT versions.
Refactor ImageReader
and BinaryFileReader
to support streaming
images, including a Python API. Also improved performance of the
readers. Check the 302 notebook for usage example.
Add ClassBalancer
estimator for improving classification performance
on highly imbalanced datasets.
Create an infrastructure for automated fuzzing, serialization, and python wrapper tests.
Added a DropColumns
pipeline stage.
ImageFeaturizer
.Enable streaming through the EnsembleByKey
transformer.
ImageReader, HDFS issue, etc.
We now provide initial support for training on a GPU VM, and an ARM
template to deploy an HDI Cluster with an associated GPU machine. See
docs/gpu-setup.md
for instructions on setting this up.
New auto-generated R wrappers for estimators and transformers. To import them into R, you can use devtools to import from the uploaded zip file. Tests and sample notebooks to come.
A new RenameColumn
transformer for renaming columns within a
pipeline.
Notebook 104: An experiment to demonstrate regression models to
predict automobile prices. This notebook demonstrates the use of
Pipeline
stages, CleanMissingData
, and
ComputePerInstanceStatistics
.
Notebook 105: Demonstrates DataConversion
to make some columns Categorical.
There us a 401 notebook in notebooks/gpu
which demonstrates CNTK
training when using a GPU VM. (It is not shown with the rest of the
notebooks yet.)
Local builds will always use a "0.0" version instead of a version based on the git repository. This should simplify the build process for developers and avoid hard-to-resolve update issues.
The TextPreprocessor
transformer can be used to find and replace all
key value pairs in an input map.
Fixed a regression in the image reader where zip files with images no longer displayed the full path to the image inside a zip file.
Additional minor bug and stability fixes.
TuneHyperparameters: parallel distributed randomized grid search for SparkML and TrainClassifier/TrainRegressor parameters. Sample notebook and python wrappers will be added in the near future.
Added PowerBIWriter
for writing and streaming data frames to
PowerBI.
Expanded image reading and writing capabilities, including using images with Spark Structured Streaming. Images can be read from and written to paths specified in a dataframe.
New functionality for convenient plotting in Python.
UDF transformer and additional UDFs.
Expanded pipeline support for arbitrary user code and libraries such as NLTK through UDFTransformer.
Refactored fuzzing system and added test coverage.
GPU training supports multiple VMs.
Updated to Conda 4.3.31, which comes with Python 3.6.3.
Also updated SBT and JVM.