Spark ML -- Latent Dirichlet Allocation
Fit a Latent Dirichlet Allocation (LDA) model to a Spark DataFrame.
ml_lda(x, features = tbl_vars(x), k = length(features), alpha = (50/k) +
1, beta = 0.1 + 1, optimizer = "online", max.iterations = 20,
ml.options = ml_options(), ...)Arguments
| x | An object coercable to a Spark DataFrame (typically, a
|
| features | The name of features (terms) to use for the model fit. |
| k | The number of topics to estimate. |
| alpha | Concentration parameter for the prior placed on documents' distributions over topics. This is a singleton which is replicated to a vector of length |
| beta | Concentration parameter for the prior placed on topics' distributions over terms. For Expectation-Maximization optimizer value should be > 1.0 and by default |
| optimizer | The optimizer, either |
| max.iterations | Maximum number of iterations. |
| ml.options | Optional arguments, used to affect the model generated. See
|
| ... | Optional arguments. The |
Note
The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
For terminology used in LDA model see Spark LDA documentation.
Expectation-Maximization: Asuncion et al. On Smoothing and Inference for Topic Models. Uncertainty in Artificial Intelligence, 2009.
References
Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Asuncion et al. (2009)
See also
Other Spark ML routines: ml_als_factorization,
ml_decision_tree,
ml_generalized_linear_regression,
ml_gradient_boosted_trees,
ml_kmeans,
ml_linear_regression,
ml_logistic_regression,
ml_multilayer_perceptron,
ml_naive_bayes,
ml_one_vs_rest, ml_pca,
ml_random_forest,
ml_survival_regression
Examples
# NOT RUN {
library(janeaustenr)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
austen_books <- austen_books()
books_tbl <- sdf_copy_to(sc, austen_books, overwrite = TRUE)
first_tbl <- books_tbl %>% filter(nchar(text) > 0) %>% head(100)
first_tbl %>%
ft_tokenizer("text", "tokens") %>%
ft_count_vectorizer("tokens", "features") %>%
ml_lda("features", k = 4)
# }