-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50849][Connect] Add example project to demonstrate Spark Connect Server Libraries #49604
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered in #49604 (comment)
...-examples/server-library-example/server/src/main/scala/org/example/CustomCommandPlugin.scala
Outdated
Show resolved
Hide resolved
*/ | ||
def flush(): Unit = { | ||
// Write dataset to disk as a CSV file | ||
DariaWriters.writeSingleFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this better than using dt.write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regular write operation creates a folder and files in the format mydata.csv/part-00000
. For simplicity, I figured writing a single '.csv' would work best (but comes at the cost of requiring the spark-daria jar here)
import org.example.{CustomPluginBase, CustomTable} | ||
import org.example.proto | ||
|
||
class CustomRelationPlugin extends RelationPlugin with CustomPluginBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a bit if doc as to why this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
|
||
import scala.collection.JavaConverters._ | ||
|
||
class CustomCommandPlugin extends CommandPlugin with CustomPluginBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a bit of doc here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
A couple of comments.
Can you make sure this is actually build as part of CI?
Thanks for the review @hvanhovell!
Do you mean to have CI compile this project? I explicitly had this unlinked to the parent Spark POM as it is meant to be 'standalone' |
@@ -0,0 +1,133 @@ | |||
# Spark Server Library Example - Custom Datasource Handler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark Server
-> Spark Connect Server
?
* limitations under the License. | ||
*/ | ||
|
||
package org.example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little weird to have org.example
in ASF repository.
Technically, we always use org.apache.
prefix.
$ git grep '^package org' | grep -v org.apache | awk '{print $NF}' | sort | uniq -c
import org.apache.spark.connect.proto.Command | ||
import org.example.proto | ||
import org.example.proto.CreateTable.Column.{DataType => ProtoDataType} | ||
import org.apache.spark.sql.{functions, Column, DataFrame, Dataset, Row, SparkSession} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache Spark has an ordering rule for import
statement.
import com.google.protobuf.Any | ||
import org.apache.spark.connect.proto.Command | ||
|
||
object Main { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have some meaningful name instead of org.example.Main
?
<parent> | ||
<groupId>org.example</groupId> | ||
<artifactId>spark-server-library-example</artifactId> | ||
<version>1.0-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
2,Jane Smith | ||
3,Bob Johnson | ||
4,Alice Williams | ||
5,Charlie Brown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use simply data.csv
as a file name because this seems to be technically CSV format?
@@ -139,3 +139,4 @@ core/src/main/resources/org/apache/spark/ui/static/package.json | |||
testCommitLog | |||
.*\.har | |||
.nojekyll | |||
dummy_data.custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, if you use .csv
or .data
extension for this new file, we don't need to touch this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an integration test suite for me, but I understand this will be used as a documentation. I left a few comments first as ASF code.
What changes were proposed in this pull request?
This PR adds a sample project,
server-library-example
(under a new directoryconnect-examples
) to demonstrate the workings of using Spark Connect Server Libraries (see #48922 for context).The sample project contains several modules (
common
,server
andclient
) to showcase how a user may choose to extend the Spark Connect protocol with custom functionality.Why are the changes needed?
Currently, there are limited resources and documentation to aid a user in building their own Spark Connect Server Libraries. This PR aims to bridge this gap by providing an exoskeleton of a project to work with.
Does this PR introduce any user-facing change?
No
How was this patch tested?
N/A
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Copilot
-------------------- Render of
README.md
below ----------------Spark Server Library Example - Custom Datasource Handler
This example demonstrates a modular maven-based project architecture with separate client, server
and common components. It leverages the extensibility of Spark Connect to create a server library
that may be attached to the server to extend the functionality of the Spark Connect server as a whole. Below is a detailed overview of the setup and functionality.
Project Structure
Functionality Overview
To demonstrate the extensibility of Spark Connect, a custom datasource handler,
CustomTable
isimplemented in the server module. The class handles reading, writing and processing data stored in
a custom format, here we simply use the
.custom
extension (which itself is a wrapper over.csv
files).
First and foremost, the client and the server must be able to communicate with each other through
custom messages that 'understand' our custom data format. This is achieved by defining custom
protobuf messages in the
common
module. The client and server modules both depend on thecommon
module to access these messages.
common/src/main/protobuf/base.proto
: Defines the baseCustomTable
which is simply representedby a path and a name.
common/src/main/protobuf/commands.proto
: Defines the custom commands that the client can sendto the server. These commands are typically operations that the server can perform, such as cloning
an existing custom table.
common/src/main/protobuf/relations.proto
: Defines customrelations
, which are a mechanism through which an optional input dataset is transformed into anoutput dataset such as a Scan.
On the client side, the
CustomTable
class mimics the style of Spark'sDataset
API, allowing theuser to perform and chain operations on a
CustomTable
object.On the server side, a similar
CustomTable
class is implemented to handle the core functionality ofreading, writing and processing data in the custom format. The plugins (
CustomCommandPlugin
andCustomRelationPlugin
) are responsible for processing the custom protobuf messages sent from the client(those defined in the
common
module) and delegating the appropriate actions to theCustomTable
.Build and Run Instructions
Navigate to the sample project from
SPARK_HOME
:cd connect-examples/server-library-example
Build and package the modules:
Download the
4.0.0-preview2
release to use as the Spark Connect Server:curl -L https://archive.apache.org/dist/spark/spark-4.0.0-preview2/spark-4.0.0-preview2-bin-hadoop3.tgz | tar xz
Copy relevant JARs to the root of the unpacked Spark distribution:
Start the Spark Connect Server with the relevant JARs:
In a different terminal, navigate back to the root of the sample project and start the client:
Notice the printed output in the client terminal as well as the creation of the cloned table: