factor_nosort {jwutil} | R Documentation |
This function generates factors more quickly, without leveraging
fastmatch
. The speed increase with fastmatch
for ICD-9 codes
was about 33
using Rcpp
, and a hashed matching algorithm.
factor_nosort(x, levels = NULL, labels = levels)
x |
An object of atomic type |
levels |
An optional character vector of levels. Is coerced to the same
type as |
labels |
A set of labels used to rename the levels, if desired. |
NaN
s are converted to NA
when used on numeric values. Extracted
from https://github.com/kevinushey/Kmisc.git
These feature from base R are missing: exclude = NA, ordered =
is.ordered(x), nmax = NA
I don't think there is any requirement for factor levels to be sorted in advance, especially not for ICD-9 codes where a simple alphanumeric sorting will likely be completely wrong.
Kevin Ushey, adapted by Jack Wasey
## Not run: pts <- icd:::random_unordered_patients(1e7) u <- unique.default(pts$code) # this shows that stringr (which uses stringi) sort takes 50% longer than # built-in R sort. microbenchmark::microbenchmark(sort(u), str_sort(u)) # this shows that \code{factor_} is about 50% faster than \code{factor} for # big vectors of strings # without sorting is much faster: microbenchmark::microbenchmark(factor(pts$code), # factor_(pts$code), factor_nosort(pts$code), times = 25 ) ## End(Not run)