ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
We're excited to announce the public release of Convokit 3.0!
The new version of ConvoKit now supports MongoDB as a backend choice for working with corpus data. This update provides several benefits, such as taking advantage of MongoDB's lazy loading to handle extremely large corpora, and ensuring resilience to unexpected crashes by continuously writing all changes to the database.
To learn more about using MongoDB as a backend choice, refer to our documentation at https://convokit.cornell.edu/documentation/storage_options.html.
Historically, ConvoKit allows you to work with conversational data directly in program memory through the Corpus class. Moreover, long term storage is provided by dumping the contents of a Corpus onto disk using the JSON format. This paradigm works well for distributing and storing static datasets, and for doing computations on conversational data that follow the pattern of doing computations on some or all of the data over a short time period and optionally storing these results on disk. For example, ConvoKit distributes datasets included with the library in JSON format, which you can load into program memory to explore and compute with.
In ConvoKit version 3.0.0, we introduce a new option for working with conversational data: the MongoDB backend. Consider a use case where you want to collect conversational data over a long time period and ensure you maintain a persistent representation of the dataset if your data collection program unexpectedly crashes. In the memory backend paradigm, this would require regularly dumping your corpus to JSON files, requiring repeated expensive write operations. On the other hand, with the new database backend, all your data is automatically saved for long term storage in the database as it is added to the corpus.
Please refer to this database setup document to setup a mongoDB database and this storage document for a further explanation of how the database backend option works.
Updated tests to include db_mode testing.
Updated examples to include demonstration of db_mode usage.
corpus.utterances
throws an error in politenessAPI
as it should call corpus.iter_utterances()
instead. Corpus items should not access their private variables and should use the public "getters" for access.coordination.py
for the usage of metadata mutability.pair_mode
set to maximize
causing the pairing function to return an integer, which causes an error in pairing objects.Modified ConvoKit.Metadata
to disallow any mutability to metadata fields. Implemented by returning deepcopy of metadata field storage every time the field is accessed. It is intended to align the behaviors between memory and DB modes. #197
Added:
__init__
in model/corpus.py
with parameters for DB functionality. #175model/backendMapper
to separate memory and DB transactions. #175Changed:
Fixed:
coordination.py
. #197pair_mode
was set to maximize
, causing the pairing function to return an integer and subsequently leading to an error. #197corpus.utterances
to throw an error within politenessAPI
. #170Python Version Requirement Update:
v2.5.2 release adds support for Chinese politeness strategy extraction. Currently, ConvoKit's politenessStrategies supports three politeness strategy collections covering two languages.
v2.5.3 release fixes a minor bug that occurs when using TextParser with SpaCy>3.2.0.
This release adds support for Chinese politeness strategy extraction. Currently, ConvoKit's politenessStrategies supports three politeness strategy collections covering two languages.
This release includes a new method from_pandas
in the Corpus class that should simplify the Corpus creation process.
It generates a ConvoKit corpus from pandas dataframes of speakers, utterances, and conversations.
A notebook demonstrating the use of this method can be found here.
This release contains an implementation of the Expected Conversational Context Framework, and associated demos.
This release describes changes that have been implemented as part of the v2.4 release.
Vectors and Matrices now get first-class treatment in ConvoKit. Vector data can now be stored in a ConvoKitMatrix object that is integrated with the Corpus and its objects, allowing for straightforward access from Corpus component objects, user-friendly display of vectors data, and more. Read our introduction to vectors for more details.
Accordingly, we have re-implemented the relevant Transformers that were already using array or vector-like data to leverage on this new data structure, namely:
The last two Transformers can now be used for any general vector data, as opposed to just bag-of-words vector data.
We have implemented a formal way to delete metadata attributes from a Corpus component object. Prior to this, metadata attributes were deleted from objects individually -- leading to possible inconsistencies between the ConvoKitIndex (that tracks what metadata attributes currently exist) and the Corpus component objects. To rectify this, we now disallow deletion of metadata attributes from objects individually. Such deletion should instead be carried out using the Corpus method delete_metadata()
.
text_func
values for the three main component types: utterance, speaker, and conversation.corpus.iterate_by()
is now deprecated.fit
and transform
calls.index.json
during dumps of any currently existing corpora, but will have no compatibility issues with loading from existing corpora.This release describes changes that have happened since the v2.3 release, and includes changes from both v2.3.1 and v2.3.2.
Utterance.root
has been renamed to Utterance.conversation_id
User
has been renamed to Speaker
. Functions with 'user' in the name have been renamed accordinglyUser.name
has been renamed to Speaker.id
(Backwards compatibility will be maintained for all the deprecated attributes and functions.)
pandas
DataFrames for its internal components using get_conversations_dataframe()
, get_utterances_dataframe()
, and get_speakers_dataframe()
.Conversation
objects have a get_chronological_speaker_list()
method for getting a chronological list of conversation participantsConversation
's print_conversation_structure()
method has a new argument limit
for limiting the number of utterances displayed to the number specified in limit
.invalid_val
argument for HyperConvo
that automatically replaces NaN values with the default value specified in invalid_val
.FightingWords.summarize()
now provides labelled plotsdownload()
when downloading Reddit corpora.HyperConvo
that were causing NaN warnings and incorrect calculation. Fixed minor bug that was causing HyperConvo annotations to not be JSON-serializable.Classifier
and BoWClassifier
that was causing inconsistent behaviour for compressed vs. uncompressed vector metadataSome Transformers now have a summarize() function that summarizes the annotated corpus (i.e. annotated by a transform() call) in a way that gives the user a high-level view / interpretation of the annotated metadata.
We introduce several new Transformers: Classifier, Bag-of-Words Classifier, Ranker, Pairer, Paired Prediction, Paired Bag-of-Words Prediction, Fighting Words, and (Conversational) Forecaster (with variants: Bag-of-Words and CRAFT).
We introduce TextCleaner, which does text cleaning for online text data. This cleaner depends on the clean-text package.
Tree operations
Updates to various parts of ConvoKit:
Added support for creating Transformers that compute utterance attributes. Also updated support for dependency-parsing text. An example of how this new functionality can be used is found here.
Added some functionality to
Updated the code used to compute prompt types and phrasing motifs, deprecating the old QuestionTypology module. An example of how the updated code is used can be found here and here.
Updated code used to compute linguistic divergence.
Added support for pipelining, and some limited support for computing per-utterance attributes.
This is the public release of the brand-new, overhauled ConvoKit API, marking a major version number bump to 2.0.
Compared to previous releases, the newly refactored API has been heavily streamlined to unite all conversational analysis modules under a single consistent interface, which should hopefully decrease the learning curve for the toolkit. The new API is inspired by scikit-learn and should be familiar to those who have prior experience with that package. A high-level explanation of the API and object model can be found here along with a step-by-step tutorial for getting started programming with ConvoKit.