Spark ML -- Latent Dirichlet Allocation

Arguments
Note
References
See also
Examples

Fit a Latent Dirichlet Allocation (LDA) model to a Spark DataFrame.

ml_lda(x, features = tbl_vars(x), k = length(features), alpha = (50/k) +
  1, beta = 0.1 + 1, optimizer = "online", max.iterations = 20,
  ml.options = ml_options(), ...)

Arguments

x	An object coercable to a Spark DataFrame (typically, a `tbl_spark`).
features	The name of features (terms) to use for the model fit.
k	The number of topics to estimate.
alpha	Concentration parameter for the prior placed on documents' distributions over topics. This is a singleton which is replicated to a vector of length `k` in fitting (as currently EM optimizer only supports symmetric distributions, so all values in the vector should be the same). For Expectation-Maximization optimizer values should be > 1.0. By default `alpha = (50 / k) + 1`, where `50/k` is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM.
beta	Concentration parameter for the prior placed on topics' distributions over terms. For Expectation-Maximization optimizer value should be > 1.0 and by default `beta = 0.1 + 1`, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
optimizer	The optimizer, either `online` for Online Variational Bayes or `em` for Expectation-Maximization.
max.iterations	Maximum number of iterations.
ml.options	Optional arguments, used to affect the model generated. See `ml_options` for more details.
...	Optional arguments. The `data` argument can be used to specify the data to be used when `x` is a formula; this allows calls of the form `ml_linear_regression(y ~ x, data = tbl)`, and is especially useful in conjunction with `do`.

Note

The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.

For terminology used in LDA model see Spark LDA documentation.

Expectation-Maximization: Asuncion et al. On Smoothing and Inference for Topic Models. Uncertainty in Artificial Intelligence, 2009.

References

Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

Asuncion et al. (2009)

Examples

# NOT RUN {
library(janeaustenr)
library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local")

austen_books <- austen_books()
books_tbl <- sdf_copy_to(sc, austen_books, overwrite = TRUE)
first_tbl <- books_tbl %>% filter(nchar(text) > 0) %>% head(100)

first_tbl %>%
  ft_tokenizer("text", "tokens") %>%
  ft_count_vectorizer("tokens", "features") %>%
  ml_lda("features", k = 4)
# }

Spark ML -- Latent Dirichlet Allocation

Arguments

Note

References

See also

Examples