Sparklyr 0.7 (UNRELEASED)
Added support for
context
parameter inspark_apply()
to allow callers to pass additional contextual information to thef()
closure.Implemented workaround to support in
spark_write_table()
formode = 'append'
.Various ML improvements, including support for pipelines, additional algorithms, hyper-parameter tuning, and better model persistence.
Added
spark_read_libsvm()
for reading libsvm files.Added support for separating struct columns in
sdf_separate_column()
.Fixed collection of
short
,float
andbyte
to properly return NAs.Added
sparklyr.collect.datechars
option to enable collectingDateType
andTimestampTime
ascharacters
to support compatibility with previos versions.Fixed collection of
DateType
andTimestampTime
fromcharacter
to properDate
andPOSIXct
types.
Sparklyr 0.6.4
Added support for HTTPS for
yarn-cluster
which is activated by settingyarn.http.policy
toHTTPS_ONLY
inyarn-site.xml
.Added support for
sparklyr.yarn.cluster.accepted.timeout
underyarn-cluster
to allow users to wait for resources under cluster with high waiting times.Fix to
spark_apply()
when package distribution deadlock triggers in environments where multiple executors run under the same node.Added support in
spark_apply()
for specifying a list ofpackages
to distribute to each worker node.Added support in
yarn-cluster
forsparklyr.yarn.cluster.lookup.prefix
,sparklyr.yarn.cluster.lookup.username
andsparklyr.yarn.cluster.lookup.byname
to control the new application lookup behavior.
Sparklyr 0.6.3
Enabled support for Java 9 for clusters configured with Hadoop 2.8. Java 9 blocked on ‘master=local’ unless ‘options(sparklyr.java9 = TRUE)’ is set.
Fixed issue in
spark_connect()
where usingset.seed()
before connection would cause session ids to be duplicates and connections to be reused.Fixed issue in
spark_connect()
blocking gateway port when connection was never started to the backend, for isntasnce, while interrupting the r session while connecting.Performance improvement for quering field names from tables impacting tables and
dplyr
queries, most noticeable inna.omit
with several columns.Fix to
spark_apply()
when closure returns adata.frame
that contains no rows and has one or more columns.Fix to
spark_apply()
while usingtryCatch()
within closure and increased callstack printed to logs when error triggers within closure.Added support for the
SPARKLYR_LOG_FILE
environment variable to specify the file used for log output.Fixed regression for
union_all()
affecting Spark 1.6.X.Added support for
na.omit.cache
option that when set toFALSE
will preventna.omit
from caching results when rows are dropped.Added support in
spark_connect()
foryarn-cluster
with hight-availability enabled.Added support for
spark_connect()
withmaster="yarn-cluster"
to query YARN resource manager API and retrieve the correct container host name.Fixed issue in
invoke()
calls while using integer arrays that containNA
which can be commonly experienced while usingspark_apply()
.Added
topics.description
underml_lda()
result.Added support for
ft_stop_words_remover()
to strip out stop words from tokens.Feature transformers (
ft_*
functions) now explicitly requireinput.col
andoutput.col
to be specified.Added support for
spark_apply_log()
to enable logging in worker nodes while usingspark_apply()
.Fix to
spark_apply()
forSparkUncaughtExceptionHandler
exception while running over large jobs that may overlap during an, now unnecesary, unregister operation.Fix race-condition first time
spark_apply()
is run when more than one partition runs in a worker and both processes try to unpack the packages bundle at the same time.spark_apply()
now adds generic column names when needed and validatesf
is afunction
.Improved documentation and error cases for
metric
argument inml_classification_eval()
andml_binary_classification_eval()
.Fix to
spark_install()
to use the/logs
subfolder to store locallog4j
logs.Fix to
spark_apply()
when R is used from a worker node since worker node already contains packages but still might be triggering different R session.Fix connection from closing when
invoke()
attempts to use a class with a method that contains a reference to an undefined class.Implemented all tuning options from Spark ML for
ml_random_forest()
,ml_gradient_boosted_trees()
, andml_decision_tree()
.Avoid tasks failing under
spark_apply()
and multiple concurrent partitions running while selecting backend port.Added support for numeric arguments for
n
inlead()
for dplyr.Added unsupported error message to
sample_n()
andsample_frac()
when Spark is not 2.0 or higher.Fixed
SIGPIPE
error underspark_connect()
immediately after aspark_disconnect()
operation.Added support for
sparklyr.apply.env.
underspark_config()
to allowspark_apply()
to initializae environment varaibles.Added support for
spark_read_text()
andspark_write_text()
to read from and to plain text files.Addesd support for RStudio project templates to create an “R Package using sparklyr”.
Fix
compute()
to trigger refresh of the connections view.Added a
k
argument toml_pca()
to enable specification of number of principal components to extract. Also implementedsdf_project()
to project datasets using the results ofml_pca()
models.Added support for additional livy session creation parameters using the
livy_config()
function.
Sparklyr 0.6.2
- Fix connection_spark_shinyapp() under RStudio 1.1 to avoid error while listing Spark installation options for the first time.
Sparklyr 0.6.1
Fixed error in
spark_apply()
that may triggered when multiple CPUs are used in a single node due to race conditions while accesing the gateway service and another in theJVMObjectTracker
.spark_apply()
now supports explicit column types using thecolumns
argument to avoid sampling types.spark_apply()
withgroup_by
no longer requires persisting to disk nor memory.Added support for Spark 1.6.3 under
spark_install()
.Added support for Spark 1.6.3 under
spark_install()
spark_apply()
now logs the current callstack when it fails.Fixed error triggered while processing empty partitions in
spark_apply()
.Fixed slow printing issue caused by
print
calculating the total row count, which is expensive for some tables.Fixed
sparklyr 0.6
issue blocking concurrentsparklyr
connections, which required to setconfig$sparklyr.gateway.remote = FALSE
as workaround.
Sparklyr 0.6.0
Distributed R
Added
packages
parameter tospark_apply()
to distribute packages across worker nodes automatically.Added
sparklyr.closures.rlang
as aspark_config()
value to support generic closures provided by therlang
package.Added config options
sparklyr.worker.gateway.address
andsparklyr.worker.gateway.port
to configure gateway used under worker nodes.Added
group_by
parameter tospark_apply()
, to support operations over groups of dataframes.Added
spark_apply()
, allowing users to use R code to directly manipulate and transform Spark DataFrames.
External Data
Added
spark_write_source()
. This function writes data into a Spark data source which can be loaded through an Spark package.Added
spark_write_jdbc()
. This function writes from a Spark DataFrame into a JDBC connection.Added
columns
parameter tospark_read_*()
functions to load data with named columns or explicit column types.Added
partition_by
parameter tospark_write_csv()
,spark_write_json()
,spark_write_table()
andspark_write_parquet()
.Added
spark_read_source()
. This function reads data from a Spark data source which can be loaded through an Spark package.Added support for
mode = "overwrite"
andmode = "append"
tospark_write_csv()
.spark_write_table()
now supports saving to default Hive path.Improved performance of
spark_read_csv()
reading remote data wheninfer_schema = FALSE
.Added
spark_read_jdbc()
. This function reads from a JDBC connection into a Spark DataFrame.Renamed
spark_load_table()
andspark_save_table()
intospark_read_table()
andspark_write_table()
for consistency with existingspark_read_*()
andspark_write_*()
functions.Added support to specify a vector of column names in
spark_read_csv()
to specify column names without having to set the type of each column.Improved
copy_to()
,sdf_copy_to()
anddbWriteTable()
performance underyarn-client
mode.
dplyr
Support for
cumprod()
to calculate cumulative products.Support for
cor()
,cov()
,sd()
andvar()
as window functions.Support for Hive built-in operators
%like%
,%rlike%
, and%regexp%
for matching regular expressions infilter()
andmutate()
.Support for dplyr (>= 0.6) which among many improvements, increases performance in some queries by making use of a new query optimizer.
sample_frac()
takes a fraction instead of a percent to match dplyr.Improved performance of
sample_n()
andsample_frac()
through the use ofTABLESAMPLE
in the generated query.
Databases
Added
src_databases()
. This function list all the available databases.Added
tbl_change_db()
. This function changes current database.
DataFrames
Added
sdf_len()
,sdf_seq()
andsdf_along()
to help generate numeric sequences as Spark DataFrames.Added
spark_set_checkpoint_dir()
,spark_get_checkpoint_dir()
, andsdf_checkpoint()
to enable checkpointing.Added
sdf_broadcast()
which can be used to hint the query optimizer to perform a broadcast join in cases where a shuffle hash join is planned but not optimal.Added
sdf_repartition()
,sdf_coalesce()
, andsdf_num_partitions()
to support repartitioning and getting the number of partitions of Spark DataFrames.Added
sdf_bind_rows()
andsdf_bind_cols()
– these functions are thesparklyr
equivalent ofdplyr::bind_rows()
anddplyr::bind_cols()
.Added
sdf_separate_column()
– this function allows one to separate components of an array / vector column into separate scalar-valued columns.sdf_with_sequential_id()
now supportsfrom
parameter to choose the starting value of the id column.Added
sdf_pivot()
. This function provides a mechanism for constructing pivot tables, using Spark’s ‘groupBy’ + ‘pivot’ functionality, with a formula interface similar to that ofreshape2::dcast()
.
MLlib
Added
vocabulary.only
toft_count_vectorizer()
to retrieve the vocabulary with ease.GLM type models now support
weights.column
to specify weights in model fitting. (#217)ml_logistic_regression()
now supports multinomial regression, in addition to binomial regression [requires Spark 2.1.0 or greater]. (#748)Implemented
residuals()
andsdf_residuals()
for Spark linear regression and GLM models. The former returns a R vector while the latter returns atbl_spark
of training data with aresiduals
column added.Added
ml_model_data()
, used for extracting data associated with Spark ML models.The
ml_save()
andml_load()
functions gain ameta
argument, allowing users to specify where R-level model metadata should be saved independently of the Spark model itself. This should help facilitate the saving and loading of Spark models used in non-local connection scenarios.ml_als_factorization()
now supports the implicit matrix factorization and nonnegative least square options.Added
ft_count_vectorizer()
. This function can be used to transform columns of a Spark DataFrame so that they might be used as input toml_lda()
. This should make it easier to invokeml_lda()
on Spark data sets.
Broom
- Implemented
tidy()
,augment()
, andglance()
from tidyverse/broom forml_model_generalized_linear_regression
andml_model_linear_regression
models.
R Compatibility
- Implemented
cbind.tbl_spark()
. This method works by first generating index columns usingsdf_with_sequential_id()
then performinginner_join()
. Note that dplyr_join()
functions should still be used for DataFrames with common keys since they are less expensive.
Connections
Increased default number of concurrent connections by setting default for
spark.port.maxRetries
from 16 to 128.Support for gateway connections
sparklyr://hostname:port/session
and usingspark-submit --class sparklyr.Shell sparklyr-2.1-2.11.jar <port> <id> --remote
.Added support for
sparklyr.gateway.service
andsparklyr.gateway.remote
to enable/disable the gateway in service and to accept remote connections required for Yarn Cluster mode.Added support for Yarn Cluster mode using
master = "yarn-cluster"
. Either, explicitly setconfig = list(sparklyr.gateway.address = "<driver-name>")
or implicitlysparklyr
will read thesite-config.xml
for theYARN_CONF_DIR
environment variable.Added
spark_context_config()
andhive_context_config()
to retrieve runtime configurations for the Spark and Hive contexts.Added
sparklyr.log.console
to redirect logs to console, useful to troubleshootingspark_connect
.Added
sparklyr.backend.args
as config option to enable passing parameters to thesparklyr
backend.Improved logging while establishing connections to
sparklyr
.Improved
spark_connect()
performance.Implemented new configuration checks to proactively report connection errors in Windows.
While connecting to spark from Windows, setting the
sparklyr.verbose
option toTRUE
prints detailed configuration steps.Added
custom_headers
tolivy_config()
to add custom headers to the REST call to the Livy server
Compilation
Added support for
jar_dep
in the compilation specification to support additionaljars
throughspark_compile()
.spark_compile()
now prints deprecation warnings.Added
download_scalac()
to assist downloading all the Scala compilers required to build usingcompile_package_jars
and provided support for using anyscalac
minor versions while looking for the right compiler.
Backend
- Improved backend logging by adding type and session id prefix.
Miscellaneous
copy_to()
andsdf_copy_to()
auto generate aname
when an expression can’t be transformed into a table name.Implemented
type_sum.jobj()
(from tibble) to enable better printing of jobj objects embedded in data frames.Added the
spark_home_set()
function, to help facilitate the setting of theSPARK_HOME
environment variable. This should prove useful in teaching environments, when teaching the basics of Spark and sparklyr.Added support for the
sparklyr.ui.connections
option, which adds additional connection options into the new connections dialog. Therstudio.spark.connections
option is now deprecated.Implemented the “New Connection Dialog” as a Shiny application to be able to support newer versions of RStudio that deprecate current connections UI.
Bug Fixes
When using
spark_connect()
in local clusters, it validates thatjava
exists underJAVA_HOME
to help troubleshoot systems that have an incorrectJAVA_HOME
.Improved
argument is of length zero
error triggered while retrieving data with no columns to display.Fixed
Path does not exist
referencinghdfs
exception duringcopy_to
under systems configured withHADOOP_HOME
.Fixed session crash after “No status is returned” error by terminating invalid connection and added support to print log trace during this error.
compute()
now caches data in memory by default. To revert this beavior usesparklyr.dplyr.compute.nocache
set toTRUE
.spark_connect()
withmaster = "local"
and a givenversion
overridesSPARK_HOME
to avoid existing installation mismatches.Fixed
spark_connect()
under Windows issue whennewInstance0
is present in the logs.Fixed collecting
long
type columns when NAs are present (#463).Fixed backend issue that affects systems where
localhost
does not resolve properly to the loopback address.Fixed issue collecting data frames containing newlines
\n
.Spark Null objects (objects of class NullType) discovered within numeric vectors are now collected as NAs, rather than lists of NAs.
Fixed warning while connecting with livy and improved 401 message.
Fixed issue in
spark_read_parquet()
and other read methods in whichspark_normalize_path()
would not work in some platforms while loading data using custom protocols likes3n://
for Amazon S3.Resolved issue in
spark_save()
/load_table()
to support saving / loading data and added path parameter inspark_load_table()
for consistency with other functions.
Sparklyr 0.5.5
- Implemented support for
connectionViewer
interface required in RStudio 1.1 andspark_connect
withmode="databricks"
.
Sparklyr 0.5.4
- Implemented support for
dplyr 0.6
and Spark 2.1.x.
Sparklyr 0.5.3
- Implemented support for
DBI 0.6
.
Sparklyr 0.5.2
Fix to
spark_connect
affecting Windows users and Spark 1.6.x.Fix to Livy connections which would cause connections to fail while connection is on ‘waiting’ state.
Sparklyr 0.5.0
Implemented basic authorization for Livy connections using
livy_config_auth()
.Added support to specify additional
spark-submit
parameters using thesparklyr.shell.args
environment variable.Renamed
sdf_load()
andsdf_save()
tospark_read()
andspark_write()
for consistency.The functions
tbl_cache()
andtbl_uncache()
can now be using without requiring thedplyr
namespace to be loaded.spark_read_csv(..., columns = <...>, header = FALSE)
should now work as expected – previously,sparklyr
would still attempt to normalize the column names provided.Support to configure Livy using the
livy.
prefix in theconfig.yml
file.Implemented experimental support for Livy through:
livy_install()
,livy_service_start()
,livy_service_stop()
andspark_connect(method = "livy")
.The
ml
routines now acceptdata
as an optional argument, to support calls of the form e.g.ml_linear_regression(y ~ x, data = data)
. This should be especially helpful in conjunction withdplyr::do()
.Spark
DenseVector
andSparseVector
objects are now deserialized as R numeric vectors, rather than Spark objects. This should make it easier to work with the output produced bysdf_predict()
with Random Forest models, for example.Implemented
dim.tbl_spark()
. This should ensure thatdim()
,nrow()
andncol()
all produce the expected result withtbl_spark
s.Improved Spark 2.0 installation in Windows by creating
spark-defaults.conf
and configuringspark.sql.warehouse.dir
.Embedded Apache Spark package dependencies to avoid requiring internet connectivity while connecting for the first through
spark_connect
. Thesparklyr.csv.embedded
config setting was added to configure a regular expression to match Spark versions where the embedded package is deployed.Increased exception callstack and message length to include full error details when an exception is thrown in Spark.
Improved validation of supported Java versions.
The
spark_read_csv()
function now accepts theinfer_schema
parameter, controlling whether the columns schema should be inferred from the underlying file itself. Disabling this should improve performance when the schema is known beforehand.Added a
do_.tbl_spark
implementation, allowing for the execution ofdplyr::do
statements on Spark DataFrames. Currently, the computation is performed in serial across the different groups specified on the Spark DataFrame; in the future we hope to explore a parallel implementation. Note thatdo_
always returns atbl_df
rather than atbl_spark
, as the objects produced within ado_
query may not necessarily be Spark objects.Improved errors, warnings and fallbacks for unsupported Spark versions.
sparklyr
now defaults totar = "internal"
in its calls tountar()
. This should help resolve issues some Windows users have seen related to an inability to connect to Spark, which ultimately were caused by a lack of permissions on the Spark installation.Resolved an issue where
copy_to()
and other R => Spark data transfer functions could fail when the last column contained missing / empty values. (#265)Added
sdf_persist()
as a wrapper to the Spark DataFramepersist()
API.Resolved an issue where
predict()
could produce results in the wrong order for large Spark DataFrames.Implemented support for
na.action
with the various Spark ML routines. The value ofgetOption("na.action")
is used by default. Users can customize thena.action
argument through theml.options
object accepted by all ML routines.On Windows, long paths, and paths containing spaces, are now supported within calls to
spark_connect()
.The
lag()
window function now accepts numeric values forn
. Previously, only integer values were accepted. (#249)Added support to configure Ppark environment variables using
spark.env.*
config.Added support for the
Tokenizer
andRegexTokenizer
feature transformers. These are exported as theft_tokenizer()
andft_regex_tokenizer()
functions.Resolved an issue where attempting to call
copy_to()
with an Rdata.frame
containing many columns could fail with a Java StackOverflow. (#244)Resolved an issue where attempting to call
collect()
on a Spark DataFrame containing many columns could produce the wrong result. (#242)Added support to parameterize network timeouts using the
sparklyr.backend.timeout
,sparklyr.gateway.start.timeout
andsparklyr.gateway.connect.timeout
config settings.Improved logging while establishing connections to
sparklyr
.Added
sparklyr.gateway.port
andsparklyr.gateway.address
as config settings.The
spark_log()
function now accepts thefilter
parameter. This can be used to filter entries within the Spark log.Increased network timeout for
sparklyr.backend.timeout
.Moved
spark.jars.default
setting from options to Spark config.sparklyr
now properly respects the Hive metastore directory with thesdf_save_table()
andsdf_load_table()
APIs for Spark < 2.0.0.Added
sdf_quantile()
as a means of computing (approximate) quantiles for a column of a Spark DataFrame.Added support for
n_distinct(...)
within thedplyr
interface, based on call to Hive functioncount(DISTINCT ...)
. (#220)
Sparklyr 0.4.0
- First release to CRAN.