Tidy data structures, summaries, and visualisations for missing data
Added all_miss()
/ all_na()
equivalent to all(is.na(x))
Added any_complete()
equivalent to all(complete.cases(x))
Added any_miss()
equivalent to anyNA(x)
Added common_na_numbers
and finalised common_na_strings
- to provide a
list of commonly used NA values
#168
Added miss_var_which
, to lists the variable names with missings
Added as_shadow_upset
which gets the data into a format suitable for
plotting as an UpSetR
plot:
airquality %>%
as_shadow_upset() %>%
UpSetR::upset()
Added some imputation functions to assist with exploring missingness structure and visualisation:
impute_below
Perfoms as for shadow_shift
, but performs on all columns.
This means that it imputes missing values 10% below the range of the
data (powered by shadow_shift
), to facilitate graphical exloration of
the data. Closes #145
There are also scoped variants that work for specific named columns:
impute_below_at
, and for columns that satisfy some predicate function:
impute_below_if
.impute_mean
, imputes the mean value, and scoped variants
impute_mean_at
, and impute_mean_if
.impute_below
and shadow_shift
gain arguments prop_below
and jitter
to control the degree of shift, and also the extent of jitter.
Added complete_{case/var}_{pct/prop}
, which complement
miss_{var/case}_{pct/prop}
#150
Added unbind_shadow
and unbind_data
as helpers to remove shadow columns
from data, and data from shadows, respectively.
Added is_shadow
and are_shadow
to determine if something contains a
shadow column. simimlar to rlang::is_na
and rland::are_na
, is_shadow
this returns a logical vector of length 1, and are_shadow
returns a logical
vector of length of the number of names of a data.frame. This might be
revisited at a later point (see any_shade
in add_label_shadow
).
Aesthetics now map as expected in geom_miss_point(). This means you can write
things like geom_miss_point(aes(colour = Month))
and it works appropriately.
Fixed by Luke Smith in Pull request
#144, fixing
#137.
miss_var_summary
and miss_case_summary
now return use order = TRUE
by
default, so cases and variables with the most missings are presented in
descending order. Fixes #163
Changes for Visualisation:
gg_miss_case
and gg_miss_var
to
lorikeet purple (from ochRe package: https://github.com/ropenscilabs/ochRe)gg_miss_case
order_cases = TRUE
.show_pct
option to be consistent with gg_miss_var
#153
gg_miss_which
is rotated 90 degrees so it is easier to read variable namesgg_miss_fct
uses a minimal theme and tilts the axis labels
#118.imported is_na
and are_na
from rlang
.
Added common_na_strings
, a list of common NA
values
#168.
Added some detail on alternative methods for replacing with NA in the vignette "replacing values with NA".
"The Founding of naniar
the first version on CRAN! The name is taken from Chapter 9 of The Magician's Nephew. Below is the updated NEWS file
naniar
"=========================
naniar
onto CRAN, updates to naniar
will happen reasonably regularly after this approximately every 1-2 months=========================
naniar
miss_case_cumsum
/ miss_var_cumsum
/ replace_to_na
gg_var_cumsum
& gg_case_cumsum
group_by
is now respected by the following functions:
miss_case_cumsum()
miss_case_summary()
miss_case_table()
miss_prop_summary()
miss_var_cumsum()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
label_missing*
to label_miss
to be more consistent with the rest
of naniarpct
and prop
helpers (#78)miss_df_pct
- this was literally the same as pct_miss
or prop_miss
.gg_miss_var
gets a show_pct
argument to show the percentage of missing values (Thanks Jennifer for the helpful feedback! :))miss_var_summary
& miss_case_summary
now have consistent output (one was ordered by n_missing, not the other).miss_case_pct
enquo_x
is now x
(as adviced by Hadley)=========================
replace_to_na
is a complement to tidyr::replace_na
and replaces a specified
value from a variable to NA.gg_miss_fct
returns a heatmap of the number of missings per variable for
each level of a factor. This feature was very kindly contributed by
Colin Fay.gg_miss_
functions now return a ggplot object, which behave as such.
gg_miss_
basic themes can be overriden with ggplot functions. This fix
was very kindly contributed by Colin Fay.add_*
functions handle bare unqouted names where appropriate as per #61add_*
familygeom_missing_point()
to geom_miss_point()
, to keep consistent with the rest of the functions in naniar
.=========================
brfss
and tao
as per #59=========================
add_label_missings()
add_label_shadow()
cast_shadow()
cast_shadow_shift()
cast_shadow_shift_label()
added github issue / contribution / pull request guides
ts
generic functions are now miss_var_span
and miss_var_run
, and gg_miss_span
and work on data.frame
's, as opposed to just ts
objects.
add_shadow_shift()
adds a column of shadow_shifted values to the current dataframe, adding "_shift" as a suffix
cast_shadow()
- acts like bind_shadow()
but allows for specifying which columns to add
shadow_shift
now has a method for factors - powered by forcats::fct_explicit_na()
#3
is_na
function to label_na
tidy-miss-[topic]
gg_missing_*
is changed to gg_miss_*
to fit with other syntaxmiss_cat
, shadow_df
and shadow_cat
, as they are no longer needed, and have been superceded by label_missing_2d
, as_shadow
, and is_na
.pedestrian
- contains hourly counts of pedestriansmiss_ts_run()
: return the number of missings / complete in a single runmiss_ts_summary()
: return the number of missings in a given time periodgg_miss_ts()
: plot the number of missings in a given time periodnaniar
to narnia
- I had to explain the spelling a few times when I was introducing the package and I realised that I should change the name. Fortunately it isn't on CRAN yet.=========================
prop_miss
and the complement prop_complete
. Where n_miss
returns the number of missing values, prop_miss
returns the proportion of missing values. Likewise, prop_complete
returns the proportion of complete values.The left hand side functions have been made defunct in favour of the right hand side.
- percent_missing_case()
--> miss_case_pct()
- percent_missing_var()
--> miss_var_pct()
- percent_missing_df()
--> miss_df_pct()
- summary_missing_case()
--> miss_case_summary()
- summary_missing_var()
--> miss_var_summary()
- table_missing_case()
--> miss_case_table()
- table_missing_var()
--> miss_var_table()
=========================
miss_*
= I want to explore missing valuesmiss_case_*
= I want to explore missing casesmiss_case_pct
= I want to find the percentage of cases containing a missing valuemiss_case_summary
= I want to find the number / percentage of missings in each case
miss_case_table
= I want a tabulation of the number / percentage of cases missingThis is more consistent and easier to reason with.
Thus, I have renamed the following functions:
- percent_missing_case()
--> miss_case_pct()
- percent_missing_var()
--> miss_var_pct()
- percent_missing_df()
--> miss_df_pct()
- summary_missing_case()
--> miss_case_summary()
- summary_missing_var()
--> miss_var_summary()
- table_missing_case()
--> miss_case_table()
- table_missing_var()
--> miss_var_table()
These will be made defunct in the next release, 0.0.6.9000 ("The Wood Between Worlds").
=========================
n_complete
is a complement to n_miss
, and counts the number of complete values in a vector, matrix, or dataframe.shadow_shift
now handles cases where there is only 1 complete value in a vector.testthat
.=========================
After a burst of effort on this package I have done some refactoring and thought hard about where this package is going to go. This meant that I had to make the decision to rename the package from ggmissing to naniar. The name may strike you as strange but it reflects the fact that there are many changes happening, and that we will be working on creating a nice utopia (like Narnia by CS Lewis) that helps us make it easier to work with missing data
add_n_miss
and add_prop_miss
are helpers that add columns to a dataframe containing the number and proportion of missing values. An example has been provided to use decision trees to explore missing data structure as in Tierney et al
geom_miss_point()
now supports transparency, thanks to @seasmith (Luke Smith)
more shadows. These are mainly around bind_shadow
and gather_shadow
, which are helper functions to assist with creating
geom_missing_point()
broke after the new release of ggplot2 2.2.0, but this is now fixed by ensuring that it inherits from GeomPoint, rather than just a new Geom. Thanks to Mitchell O'hara-Wild for his help with this.
missing data summaries table_missing_var
and table_missing_case
also now return more sensible numbers and variable names. It is possible these function names will change in the future, as these are kind of verbose.
semantic versioning was incorrectly entered in the DESCRIPTION file as 0.2.9000, so I changed it to 0.0.2.9000, and then to 0.0.3.9000 now to indicate the new changes, hopefully this won't come back to bite me later. I think I accidentally did this with visdat at some point as well. Live and learn.
gathered related functions into single R files rather than leaving them in their own.
correctly imported the %>%
operator from magrittr, and removed a lot of chaff around @importFrom
- really don't need to use @importFrom
that often.
=========================
geom_missing_point()
now works in a way that we expect! Thanks to Miles McBain for working out how to get this to work.=========================
percent_missing_df
returns the percentage of missing data for a data.framepercent_missing_var
the percentage of variables that contain missing valuespercent_missing_case
the percentage of cases that contain missing values.table_missing_var
table of missing information for variablestable_missing_case
table of missing information for casessummary_missing_var
summary of missing information for variables (counts, percentages)summary_missing_case
summary of missing information for variables (counts, percentages)n_complete
is a complement to n_miss
, and counts the number of complete values in a vector, matrix, or dataframe.shadow_shift
now handles cases where there is only 1 complete value in a vector.testthat
.add_n_miss
and add_prop_miss
are helpers that add columns to a dataframe containing the number and proportion of missing values. An example has been provided to use decision trees to explore missing data structure as in Tierney et al
geom_miss_point()
now supports transparency, thanks to @seasmith (Luke Smith)After a burst of effort on this package I have done some refactoring and thought hard about where this package is going to go. This meant that I had to make the decision to rename the package from ggmissing to naniar. The name may strike you as strange but it reflects the fact that there are many changes happening, and that we will be working on creating a nice utopia (like Narnia by CS Lewis) that helps us make it easier to work with missing data
bind_shadow
and gather_shadow
, which are helper functions to assist with creatinggeom_missing_point()
broke after the new release of ggplot2 2.2.0, but this is now fixed by ensuring that it inherits from GeomPoint, rather than just a new Geom. Thanks to Mitchell O'hara-Wild for his help with this.table_missing_var
and table_missing_case
also now return more sensible numbers and variable names. It is possible these function names will change in the future, as these are kind of verbose.%>%
operator from magrittr, and removed a lot of chaff around @importFrom
- really don't need to use @importFrom
that often.geom_missing_point()
now works in a way that we expect! Thanks to Miles McBain for working out how to get this to work.percent_missing_df
returns the percentage of missing data for a data.framepercent_missing_var
the percentage of variables that contain missing valuespercent_missing_case
the percentage of cases that contain missing values.table_missing_var
table of missing information for variablestable_missing_case
table of missing information for casessummary_missing_var
summary of missing information for variables (counts, percentages)summary_missing_case
summary of missing information for variables (counts, percentages)