[SPARK-50849][Connect] Add example project to demonstrate Spark Connect Server Libraries #49604

vicennial · 2025-01-22T14:04:53Z

What changes were proposed in this pull request?

This PR adds a sample project, server-library-example (under a new directory connect-examples) to demonstrate the workings of using Spark Connect Server Libraries (see #48922 for context).
The sample project contains several modules (common, server and client) to showcase how a user may choose to extend the Spark Connect protocol with custom functionality.

Why are the changes needed?

Currently, there are limited resources and documentation to aid a user in building their own Spark Connect Server Libraries. This PR aims to bridge this gap by providing an exoskeleton of a project to work with.

Does this PR introduce any user-facing change?

No

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Copilot

-------------------- Render of README.md below ----------------

Spark Server Library Example - Custom Datasource Handler

This example demonstrates a modular maven-based project architecture with separate client, server
and common components. It leverages the extensibility of Spark Connect to create a server library
that may be attached to the server to extend the functionality of the Spark Connect server as a whole. Below is a detailed overview of the setup and functionality.

Project Structure

├── common/                # Shared protobuf/utilities/classes
├── client/                # Sample client implementation 
│   ├── src/               # Source code for client functionality
│   ├── pom.xml            # Maven configuration for the client
├── server/                # Server-side plugin extension
│   ├── src/               # Source code for server functionality
│   ├── pom.xml            # Maven configuration for the server
├── resources/             # Static resources
├── pom.xml                # Parent Maven configuration

Functionality Overview

To demonstrate the extensibility of Spark Connect, a custom datasource handler, CustomTable is
implemented in the server module. The class handles reading, writing and processing data stored in
a custom format, here we simply use the .custom extension (which itself is a wrapper over .csv
files).

First and foremost, the client and the server must be able to communicate with each other through
custom messages that 'understand' our custom data format. This is achieved by defining custom
protobuf messages in the common module. The client and server modules both depend on the common
module to access these messages.

common/src/main/protobuf/base.proto: Defines the base CustomTable which is simply represented
by a path and a name.

message CustomTable {
  string path = 1;
  string name = 2;
}

common/src/main/protobuf/commands.proto: Defines the custom commands that the client can send
to the server. These commands are typically operations that the server can perform, such as cloning
an existing custom table.

message CustomCommand {
  oneof command_type {
    CreateTable create_table = 1;
    CloneTable clone_table = 2;
  }
}

common/src/main/protobuf/relations.proto: Defines custom relations, which are a mechanism through which an optional input dataset is transformed into an
output dataset such as a Scan.

message Scan {
  CustomTable table = 1;
}

On the client side, the CustomTable class mimics the style of Spark's Dataset API, allowing the
user to perform and chain operations on a CustomTable object.

On the server side, a similar CustomTable class is implemented to handle the core functionality of
reading, writing and processing data in the custom format. The plugins (CustomCommandPlugin and
CustomRelationPlugin) are responsible for processing the custom protobuf messages sent from the client
(those defined in the common module) and delegating the appropriate actions to the CustomTable.

Build and Run Instructions

Navigate to the sample project from SPARK_HOME:
```
cd connect-examples/server-library-example
```
Build and package the modules:
```
mvn clean package
```
Download the 4.0.0-preview2 release to use as the Spark Connect Server:
- Choose a distribution from https://archive.apache.org/dist/spark/spark-4.0.0-preview2/.
- Example: curl -L https://archive.apache.org/dist/spark/spark-4.0.0-preview2/spark-4.0.0-preview2-bin-hadoop3.tgz | tar xz

Copy relevant JARs to the root of the unpacked Spark distribution:

 cp \
 <SPARK_HOME>/connect-examples/server-library-example/resources/spark-daria_2.13-1.2.3.jar \
 <SPARK_HOME>/connect-examples/server-library-example/common/target/spark-server-library-example-common-1.0-SNAPSHOT.jar \
 <SPARK_HOME>/connect-examples/server-library-example/server/target/spark-server-library-example-server-extension-1.0-SNAPSHOT.jar \
 .

Start the Spark Connect Server with the relevant JARs:

 bin/spark-connect-shell \
--jars spark-server-library-example-server-extension,spark-server-library-example-common-1.0-SNAPSHOT.jar,spark-daria_2.13-1.2.3.jar \
--conf spark.connect.extensions.relation.classes=org.example.CustomRelationPlugin \
--conf spark.connect.extensions.command.classes=org.example.CustomCommandPlugin

In a different terminal, navigate back to the root of the sample project and start the client:

java -cp client/target/spark-server-library-client-package-scala-1.0-SNAPSHOT.jar org.example.Main

Notice the printed output in the client terminal as well as the creation of the cloned table:

Explaining plan for custom table: sample_table with path: <SPARK_HOME>/spark/connect-examples/server-library-example/client/../resources/dummy_data.custom
== Parsed Logical Plan ==
Relation [id#2,name#3] csv
== Analyzed Logical Plan ==
id: int, name: string
Relation [id#2,name#3] csv
== Optimized Logical Plan ==
Relation [id#2,name#3] csv
== Physical Plan ==
FileScan csv [id#2,name#3] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/venkata.gudesa/spark/connect-examples/server-library-example/resou..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string>
Explaining plan for custom table: cloned_table with path: <SPARK_HOME>/connect-examples/server-library-example/client/../resources/cloned_data.custom
== Parsed Logical Plan ==
Relation [id#2,name#3] csv
== Analyzed Logical Plan ==
id: int, name: string
Relation [id#2,name#3] csv
== Optimized Logical Plan ==
Relation [id#2,name#3] csv
== Physical Plan ==
FileScan csv [id#2,name#3] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/venkata.gudesa/spark/connect-examples/server-library-example/resou..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string>

hvanhovell · 2025-01-24T16:27:53Z

connect-examples/server-library-example/resources/spark-daria_2.13-1.2.3.jar

What does this do?

Answered in #49604 (comment)

...-examples/server-library-example/server/src/main/scala/org/example/CustomCommandPlugin.scala

hvanhovell · 2025-01-24T16:32:28Z

connect-examples/server-library-example/server/src/main/scala/org/example/CustomTable.scala

+   */
+  def flush(): Unit = {
+    // Write dataset to disk as a CSV file
+    DariaWriters.writeSingleFile(


How is this better than using dt.write?

The regular write operation creates a folder and files in the format mydata.csv/part-00000. For simplicity, I figured writing a single '.csv' would work best (but comes at the cost of requiring the spark-daria jar here)

hvanhovell · 2025-01-24T16:34:30Z

...examples/server-library-example/server/src/main/scala/org/example/CustomRelationPlugin.scala

+import org.example.{CustomPluginBase, CustomTable}
+import org.example.proto
+
+class CustomRelationPlugin extends RelationPlugin with CustomPluginBase {


Add a bit if doc as to why this is needed.

hvanhovell · 2025-01-24T16:34:48Z

...-examples/server-library-example/server/src/main/scala/org/example/CustomCommandPlugin.scala

+
+import scala.collection.JavaConverters._
+
+class CustomCommandPlugin extends CommandPlugin with CustomPluginBase {


Add a bit of doc here.

hvanhovell

Looks good!

A couple of comments.

Can you make sure this is actually build as part of CI?

vicennial · 2025-01-24T17:52:03Z

Thanks for the review @hvanhovell!

Can you make sure this is actually build as part of CI?

Do you mean to have CI compile this project? I explicitly had this unlinked to the parent Spark POM as it is meant to be 'standalone'

dongjoon-hyun · 2025-01-25T06:01:28Z

connect-examples/server-library-example/README.md

@@ -0,0 +1,133 @@
+# Spark Server Library Example - Custom Datasource Handler


Spark Server -> Spark Connect Server?

dongjoon-hyun · 2025-01-25T06:05:58Z

...t-examples/server-library-example/client/src/main/scala/org/example/CustomTableBuilder.scala

+ * limitations under the License.
+ */
+
+package org.example


It's a little weird to have org.example in ASF repository.

Technically, we always use org.apache. prefix.

$ git grep '^package org' | grep -v org.apache | awk '{print $NF}' | sort | uniq -c

dongjoon-hyun · 2025-01-25T06:06:52Z

connect-examples/server-library-example/client/src/main/scala/org/example/CustomTable.scala

+import org.apache.spark.connect.proto.Command
+import org.example.proto
+import org.example.proto.CreateTable.Column.{DataType => ProtoDataType}
+import org.apache.spark.sql.{functions, Column, DataFrame, Dataset, Row, SparkSession}


Apache Spark has an ordering rule for import statement.

dongjoon-hyun · 2025-01-25T06:08:11Z

connect-examples/server-library-example/client/src/main/scala/org/example/Main.scala

+import com.google.protobuf.Any
+import org.apache.spark.connect.proto.Command
+
+object Main {


Shall we have some meaningful name instead of org.example.Main?

dongjoon-hyun · 2025-01-25T06:08:54Z

connect-examples/server-library-example/server/pom.xml

+  <parent>
+    <groupId>org.example</groupId>
+    <artifactId>spark-server-library-example</artifactId>
+    <version>1.0-SNAPSHOT</version>


dongjoon-hyun · 2025-01-25T06:11:51Z

connect-examples/server-library-example/resources/dummy_data.custom

+2,Jane Smith
+3,Bob Johnson
+4,Alice Williams
+5,Charlie Brown


Shall we use simply data.csv as a file name because this seems to be technically CSV format?

dongjoon-hyun · 2025-01-25T06:13:05Z

dev/.rat-excludes

@@ -139,3 +139,4 @@ core/src/main/resources/org/apache/spark/ui/static/package.json
 testCommitLog
 .*\.har
 .nojekyll
+dummy_data.custom


FYI, if you use .csv or .data extension for this new file, we don't need to touch this file.

dongjoon-hyun

This looks like an integration test suite for me, but I understand this will be used as a documentation. I left a few comments first as ASF code.

vicennial added 3 commits January 22, 2025 10:56

squash

3d4c7ef

documentation and lint

fd722ea

rename + lint

58ef983

github-actions bot added BUILD DOCS labels Jan 22, 2025

hvanhovell reviewed Jan 24, 2025

View reviewed changes

...-examples/server-library-example/server/src/main/scala/org/example/CustomCommandPlugin.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Jan 24, 2025

View reviewed changes

hvanhovell requested changes Jan 24, 2025

View reviewed changes

documentation and lint

3c3ace7

dongjoon-hyun reviewed Jan 25, 2025

View reviewed changes

dongjoon-hyun requested changes Jan 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50849][Connect] Add example project to demonstrate Spark Connect Server Libraries #49604

[SPARK-50849][Connect] Add example project to demonstrate Spark Connect Server Libraries #49604

vicennial commented Jan 22, 2025 •

edited

Loading

hvanhovell Jan 24, 2025

vicennial Jan 24, 2025

hvanhovell Jan 24, 2025

vicennial Jan 24, 2025

hvanhovell Jan 24, 2025

vicennial Jan 24, 2025

hvanhovell Jan 24, 2025

vicennial Jan 24, 2025

hvanhovell left a comment

vicennial commented Jan 24, 2025

dongjoon-hyun Jan 25, 2025

dongjoon-hyun Jan 25, 2025

dongjoon-hyun Jan 25, 2025

dongjoon-hyun Jan 25, 2025

dongjoon-hyun Jan 25, 2025

dongjoon-hyun Jan 25, 2025

dongjoon-hyun Jan 25, 2025

dongjoon-hyun left a comment •

edited

Loading


		import scala.collection.JavaConverters._

		class CustomCommandPlugin extends CommandPlugin with CustomPluginBase {

		@@ -0,0 +1,133 @@
		# Spark Server Library Example - Custom Datasource Handler

[SPARK-50849][Connect] Add example project to demonstrate Spark Connect Server Libraries #49604

Are you sure you want to change the base?

[SPARK-50849][Connect] Add example project to demonstrate Spark Connect Server Libraries #49604

Conversation

vicennial commented Jan 22, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Spark Server Library Example - Custom Datasource Handler

Project Structure

Functionality Overview

Build and Run Instructions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

vicennial commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

vicennial commented Jan 22, 2025 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading