Prepare a Spark DataFrame for Spark ML Routines

Arguments
Details
Examples

This routine prepares a Spark DataFrame for use by Spark ML routines.

ml_prepare_dataframe(x, features, response = NULL, ...,
  ml.options = ml_options(), envir = new.env(parent = emptyenv()))

Arguments

x	An object coercable to a Spark DataFrame (typically, a `tbl_spark`).
features	The name of features (terms) to use for the model fit.
response	The name of the response vector (as a length-one character vector), or a formula, giving a symbolic description of the model to be fitted. When `response` is a formula, it is used in preference to other parameters to set the `response`, `features`, and `intercept` parameters (if available). Currently, only simple linear combinations of existing parameters is supposed; e.g. `response ~ feature1 + feature2 + ...`. The intercept term can be omitted by using `- 1` in the model fit.
...	Optional arguments. The `data` argument can be used to specify the data to be used when `x` is a formula; this allows calls of the form `ml_linear_regression(y ~ x, data = tbl)`, and is especially useful in conjunction with `do`.
ml.options	Optional arguments, used to affect the model generated. See `ml_options` for more details.
envir	An R environment -- when supplied, it will be filled with metadata describing the transformations that have taken place.

Details

Spark DataFrames are prepared through the following transformations:

All specified columns are transformed into a numeric data type (using a simple cast for integer / logical columns, and ft_string_indexer for strings),
The ft_vector_assembler is used to combine the specified features into a single 'feature' vector, suitable for use with Spark ML routines.

After calling this function, the envir environment (when supplied) will be populated with a set of variables:

`features`:	The name of the generated `features` vector.
`response`:	The name of the generated `response` vector.

Examples

# NOT RUN {
# example of how 'ml_prepare_dataframe' might be used to invoke
# Spark's LinearRegression routine from the 'ml' package
envir <- new.env(parent = emptyenv())
tdf <- ml_prepare_dataframe(df, features, response, envir = envir)

lr <- invoke_new(
  sc,
  "org.apache.spark.ml.regression.LinearRegression"
)

# use generated 'features', 'response' vector names in model fit
model <- lr %>%
  invoke("setFeaturesCol", envir$features) %>%
  invoke("setLabelCol", envir$response)
# }