Textshape Versions Save

Tools for reshaping text data

v0.6.0

5 years ago

NEWS

Versioning

Releases will be numbered with the following semantic versioning format:

<major>.<minor>.<patch>

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

textshape 1.5.1 - 1.6.0

BUG FIXES

  • split_match_regex_to_transcript gave warnings about perl = TRUE that were due to gsub with a fixed = TRUE clashing with the perl = TRUE.

NEW FEATURES

  • flatten added for flattening nested, named lists into single tiered lists using the concatenated list/atomic vector names as the names of the single tiered list.

  • unnest_text added to located and un-nest nested text columns in a data.frame.

IMPROVEMENTS

  • tidy_dtm/tidy_tdm did not order unnamed matrices as expected (e.g., {1, 2, ..., 1} was ordered as {1, 10, 2, ...}). This has been corrected.

textshape 1.4.0 - 1.5.0

BUG FIXES

  • split_sentence, split_word, split_token, & split_speaker` did not handle single row data.frames properly resulting in loss of data. This has been fixed.

NEW FEATURES

  • split_sentence_token added as a shortcut to split into sentences, add a sentence index, and then split into tokens and add a token index.

  • tidy_matrix and tidy_adjacency_matrix added to provide easy tidy representations of these data types.

  • cluster_matrix added for reordering the columns/rows of matrices via hierarchical clustering.

IMPROVEMENTS

  • split_sentence now handles digit(s) + inch (in.) abbreviation if not followed by a capital letter. Previously, this was split on. Additionally, post script (p.s.) is no longer split on.

textshape 1.1.0 - 1.3.0

BUG FIXES

  • tidy_list did not add content.attribute.name for lists of named vectors.

MINOR FEATURES

  • split_match_regex added as a version of split_match with regex = TRUE by default. This makes it easier to reason about what the function call is doing.

  • split_match_regex_to_transcript added to directly split a text by a person regex and convert to a two column transcript of person and dialogue.

IMPROVEMENTS

  • tidy_list now uses data.table's rbind for lists of data.frames. This means column ordering does not need to match and missing columns are automatically filled with NAs.

  • split_sentence has better handling for the 'No.' abbreviation that distinguishes between 'No.' followed by digits (assumed to be and abbreviation) and when no digits follow (assumed to be a complete sentence).

  • split_sentence has better handling for quoted material (i.e., a punctuation mark followed by single or double quotes that is not followed by a comma).

  • split_sentence has better handling for single and double middle names presented as initials.

  • split_sentence has better handling for abbreviated English units of measure.

CHANGES

  • combine.default included element names by default. This has been removed to include only the elements.

textshape 1.0.2

BUG FIXES

  • tidy_list with a list of unnamed data.frames resulted in an error (see issue #7). This issue has been fixed.

  • split_word.data.frame and split_token.data.frame both used an incorrect column naming of sentence_id for word and token index respectively. These columns are now renamed to word_id and token_id respectively.

  • split_token gets a more robust splitting algorithm.

NEW FEATURES

  • column_to_rownames added to enable one to quickly add a column as rownames easily within a pipeline. This is useful when turning a data.frame into a matrix.

  • tidy_list picks up the ability to tidy a list of named vectors into three columns.

CHANGES

  • as.tibble removed from all function arguments. This was a nice interactive feature that made programming very difficult to reason about. Having an environment dependant output would result in no adoption of the textshape package as a dependency. Additionally, set_output and tibble_output, two complementary function have been removed without being deprecated. The problem was so egregious and the package infant enough, that removal without deprecation was warranted.

textshape 1.0.1

NEW FEATURES

  • Users can now globally select a tibble output rather than a data.table output for all functions that outputted a data.table. This can be set globally via set_output. If the user does not set the output type textshape tries to infer based on whether or not the user has dplyr loaded. If dplyr is loaded then tibble is the default output.

  • set_output and tibble_output added to globally set the output type (tibble or data.table) and to check/infer the desired output type.

textshape 1.0.0

CHANGES

  • bind_list, bind_table, & bind_vector have been renamed to the more meaningful forms of tidy_list, tidy_table, & tidy_vector. The former version are now deprecated. This bumps the version to 1.0.0 as this is a major change that breaks backward compatibility.

textshape 0.1.0 - 0.2.0

NEW FEATURES

  • bind_list added to rbind a list of named data.frames or vectors.

  • split_transcript added to split a transcript style vector (e.g., c("greg: Who me", "sarah: yes you!") into a name and dialogue vector that is coerced to a data.table.

  • change_index added for extracting the indices of changes in runs within an atomic vector. Pairs well with split_index.

  • bind_vector added to cbind a named atomic vector's names and values.

  • bind_table added to cbind a table's names and values.

  • duration method for numeric vectors added as well as a starts and ends function for calculating start and end times from a numeric vector.

  • from_to added to prepare speaker data for a network lot given the flowing nature of discourse.

  • tidy_dtm & tidy_tdm added to convert a DocumentTermMatrix or TermDocumentMatrix into a tidied data.frame.

  • tidy_colo_dtm & tidy_colo_tdm added to convert a DocumentTermMatrix or TermDocumentMatrix into a collocation matrix and then a tidied data.frame.

  • unique_pairs added to compliment the output of tidy_colo_dtm & tidy_colo_tdm. Enables the removal of duplicated collocating pairs caused by symmetrical mirroring of the upper and lower triangle of the collocation matrix.

CHANGES

  • split_index now uses change_index(x) as the default when x is an atomic vector.

textshape 0.0.1

Tools that can be used to reshape text data.

v1.0.2

7 years ago

NEWS

Versioning

Releases will be numbered with the following semantic versioning format:

<major>.<minor>.<patch>

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

textshape 1.0.2

BUG FIXES

  • tidy_list with a list of unnamed data.frames resulted in an error (see issue #7). This issue has been fixed.
  • split_word.data.frame and split_token.data.frame both used an incorrect column naming of sentence_id for word and token index respectively. These columns are now renamed to word_id and token_id respectively.
  • split_token gets a more robust splitting algorithm.

NEW FEATURES

  • column_to_rownames added to enable one to quickly add a column as rownames easily within a pipeline. This is useful when turning a data.frame into a matrix.
  • tidy_list picks up the ability to tidy a list of named vectors into three columns.

CHANGES

  • as.tibble removed from all function arguments. This was a nice interactive feature that made programming very difficult to reason about. Having an environment dependant output would result in no adoption of the textshape package as a dependency. Additionally, set_output and tibble_output, two complementary function have been removed without being deprecated. The problem was so egregious and the package infant enough, that removal without deprecation was warranted.

textshape 1.0.1

NEW FEATURES

  • Users can now globally select a tibble output rather than a data.table output for all functions that outputted a data.table. This can be set globally via set_output. If the user does not set the output type textshape tries to infer based on whether or not the user has dplyr loaded. If dplyr is loaded then tibble is the default output.
  • set_output and tibble_output added to globally set the output type (tibble or data.table) and to check/infer the desired output type.

textshape 1.0.0

CHANGES

  • bind_list, bind_table, & bind_vector have been renamed to the more meaningful forms of tidy_list, tidy_table, & tidy_vector. The former version are now deprecated. This bumps the version to 1.0.0 as this is a major change that breaks backward compatibility.

textshape 0.1.0 - 0.2.0

NEW FEATURES

  • bind_list added to rbind a list of named data.frames or vectors.
  • split_transcript added to split a transcript style vector (e.g., c("greg: Who me", "sarah: yes you!") into a name and dialogue vector that is coerced to a data.table.
  • change_index added for extracting the indices of changes in runs within an atomic vector. Pairs well with split_index.
  • bind_vector added to cbind a named atomic vector's names and values.
  • bind_table added to cbind a table's names and values.
  • duration method for numeric vectors added as well as a starts and ends function for calculating start and end times from a numeric vector.
  • from_to added to prepare speaker data for a network lot given the flowing nature of discourse.
  • tidy_dtm & tidy_tdm added to convert a DocumentTermMatrix or TermDocumentMatrix into a tidied data.frame.
  • tidy_colo_dtm & tidy_colo_tdm added to convert a DocumentTermMatrix or TermDocumentMatrix into a collocation matrix and then a tidied data.frame.
  • unique_pairs added to compliment the output of tidy_colo_dtm & tidy_colo_tdm. Enables the removal of duplicated collocating pairs caused by symmetrical mirroring of the upper and lower triangle of the collocation matrix.

CHANGES

  • split_index now uses change_index(x) as the default when x is an atomic vector.

textshape 0.0.1

Tools that can be used to reshape text data.

v0.2.0

7 years ago

NEWS

Versioning

Releases will be numbered with the following semantic versioning format:

<major>.<minor>.<patch>

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

textshape 0.1.0 - 0.2.0

NEW FEATURES

  • bind_list added to rbind a list of named data.frames or vectors.
  • split_transcript added to split a transcript style vector (e.g., c("greg: Who me", "sarah: yes you!") into a name and dialogue vector that is coerced to a data.table.
  • change_index added for extracting the indices of changes in runs within an atomic vector. Pairs well with split_index.
  • bind_vector added to cbind a named atomic vector's names and values.
  • bind_table added to cbind a table's names and values.
  • duration method for numeric vectors added as well as a starts and ends function for calculating start and end times from a numeric vector.
  • from_to added to prepare speaker data for a network lot given the flowing nature of discourse.
  • tidy_dtm & tidy_tdm added to convert a DocumentTermMatrix or TermDocumentMatrix into a tidied data.frame.
  • tidy_colo_dtm & tidy_colo_tdm added to convert a DocumentTermMatrix or TermDocumentMatrix into a collocation matrix and then a tidied data.frame.
  • unique_pairs added to compliment the output of tidy_colo_dtm & tidy_colo_tdm. Enables the removal of duplicated collocating pairs caused by symmetrical mirroring of the upper and lower triangle of the collocation matrix.

CHANGES

  • split_index now uses change_index(x) as the default when x is an atomic vector.

textshape 0.0.1

Tools that can be used to reshape text data.