Create Dummy Variables
Given a column in a Spark DataFrame, generate a new Spark DataFrame containing dummy variable columns.
ml_create_dummy_variables(x, input, reference = NULL, levels = NULL,
labels = NULL, envir = new.env(parent = emptyenv()))Arguments
| x | An object coercable to a Spark DataFrame (typically, a
|
| input | The name of the input column. |
| reference | The reference label. This variable is omitted when
generating dummy variables (to avoid perfect multi-collinearity if
all dummy variables were to be used in the model fit); to generate
dummy variables for all columns this can be explicitly set as |
| levels | The set of levels for which dummy variables should be generated.
By default, constructs one variable for each unique value occurring in
the column specified by |
| labels | An optional R list, mapping values in the |
| envir | An optional R environment; when provided, it will be filled with useful auxiliary information. See Auxiliary Information for more information. |
Details
The dummy variables are generated in a similar mechanism to
model.matrix, where categorical variables are expanded into a
set of binary (dummy) variables. These dummy variables can be used for
regression of categorical variables within the various regression routines
provided by sparklyr.
Auxiliary Information
The envir argument can be used as a mechanism for returning
optional information. Currently, the following pieces are returned:
levels: |
The set of unique values discovered within the input column. |
columns: |
The column names generated. |
If the envir argument is supplied, the names of any dummy variables
generated will be included, under the labels key.