Prepare a Spark DataFrame for Spark ML Routines
This routine prepares a Spark DataFrame for use by Spark ML routines.
ml_prepare_dataframe(x, features, response = NULL, ...,
ml.options = ml_options(), envir = new.env(parent = emptyenv()))Arguments
| x | An object coercable to a Spark DataFrame (typically, a
|
| features | The name of features (terms) to use for the model fit. |
| response | The name of the response vector (as a length-one character
vector), or a formula, giving a symbolic description of the model to be
fitted. When |
| ... | Optional arguments. The |
| ml.options | Optional arguments, used to affect the model generated. See
|
| envir | An R environment -- when supplied, it will be filled with metadata describing the transformations that have taken place. |
Details
Spark DataFrames are prepared through the following transformations:
All specified columns are transformed into a numeric data type (using a simple cast for integer / logical columns, and
ft_string_indexerfor strings),The
ft_vector_assembleris used to combine the specified features into a single 'feature' vector, suitable for use with Spark ML routines.
After calling this function, the envir environment (when supplied)
will be populated with a set of variables:
features: |
The name of the generated features vector. |
response: |
The name of the generated response vector. |
Examples
# NOT RUN {
# example of how 'ml_prepare_dataframe' might be used to invoke
# Spark's LinearRegression routine from the 'ml' package
envir <- new.env(parent = emptyenv())
tdf <- ml_prepare_dataframe(df, features, response, envir = envir)
lr <- invoke_new(
sc,
"org.apache.spark.ml.regression.LinearRegression"
)
# use generated 'features', 'response' vector names in model fit
model <- lr %>%
invoke("setFeaturesCol", envir$features) %>%
invoke("setLabelCol", envir$response)
# }