Connecting to Spark query server by using Spark JDBC Driver
Applies to : Spark engine
You can connect to the Spark query server in the following ways and execute queries to analyze your data.
Before you begin
- Install watsonx.data.
- Provision native Spark engine in watsonx.data.
- JDBC Driver:
queryserver-jdbc-4.1.0-SNAPSHOT-standalone.jar
. Contact IBM Support team to get the file. - Run the Spark Query Server in Spark engine. To create a new Query Server, see Create a Spark query server.
- Connection properties - Click the three-dot menu for Query Server, click Connection Details and copy the following connection details:
- Host
- URI
- Instance
- Username
- Your IBM IAM API key.
Connecting to Spark query server by using DBeaver (JDBC client)
To connect to the Spark query server using a JDBC client, such as DBeaver, set up watsonx.data driver in DBeaver.
-
Open DBeaver and in the menu bar click on Database > Driver Manager.
-
Search for Hive. You can find Apache Hive 4+ driver under Hadoop category.
-
Click Copy.
-
Change the name to Spark watsonx.data.
-
Change the following settings :
-
In the Settings tab,
-
In the Libraries tab - Add the Spark JDBC query server JAR file.
-
Select Database Navigator, click on New Connection and complete the following steps:
- Select the newly created driver.
- Click Connect by and select URL.
- Provide the JDBC URL using the following format :
jdbc:hive2://<HOST>:443/default;instance=<INSTANCE>;httpPath=<URI>
. - Select Authentication, provide Username as your username and your IAM API key as the password.
- Save and connect to the connection by double-clicking.
Connecting to Spark query server by using Java (JDBC Client) code
Ensure your Java CLASSPATH includes the downloaded JDBC driver. For example:
java -cp queryserver-jdbc-4.1.0-SNAPSHOT-standalone.jar App.java
You can specify the following parameters and use the following Java code to connect to the Spark query server.
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.Statement;
public class App {
public static void main(String[] args) throws Exception {
// Set the below configurations from Connection Details of QueryS Server
// Exclude having https/http/www, just domain
String host = "example.com";
String Instance = "CRN/OR/INSTANCE-ID";
String uri = "/lakehouse/api/v2/spark_engines/.../query_servers/.../connect/cliservice";
String user = "EMAIL-ID/OR/USER-ID";
String apikey = "API-KEY";
String jdbcUrl = String.format("jdbc:hive2://%s/default;instance=%s;httpPath=%s;", host, Instance, uri);
// Required if your domain requires SSL certificates
// This is not required for SaaS, hence comment the below line for SaaS
// Else, we need provide trust-store path which has the SSL certificates for the host
jdbcUrl += "sslTrustStore=tech_trust.jks;trustStorePassword=Test@123";
try {
// Load the Hive JDBC driver
Class.forName("com.ibm.wxd.spark.jdbc.QueryServerDriver");
// Connect to Hive
Connection con = DriverManager.getConnection(jdbcUrl, user, apikey);
Statement stmt = con.createStatement();
System.out.println("Connected to watsonx.data Spark Query Server");
// Sample query
String sql = "show databases";
ResultSet rs = stmt.executeQuery(sql);
ResultSetMetaData rsmd = rs.getMetaData();
int columnCount = rsmd.getColumnCount();
// The column count starts from 1
for (int i = 1; i <= columnCount; i++ ) {
System.out.println(rsmd.getColumnName(i));
}
// Print result
while (rs.next()) {
System.out.println(rs.getString(1)); // Or loop through columns
}
// Clean up
rs.close();
stmt.close();
con.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Connecting to Spark query server by using Python (PyHive JDBC Client)
To connect to the Spark query server using a Python program, do the following:
-
Ensure you have Python version 3.12 or below.
-
Install pyHive using pip install PyHive[hive_pure_sasl]==0.7.0".
-
Save the follow in a file like
connect.py
.import ssl import thrift import base64 from pyhive import hive import requests import thrift.transport import thrift.transport.THttpClient import logging import contextlib from http.client import HTTPConnection # Change the following inputs class Credentials: host = "https://example.ibm.com" uri = "/lakehouse/api/v2/spark_engines/.../query_servers/.../connect/cliservice" instance_id = "CRN/OR/INSTANCE-ID" username = "EMAIL-ID/OR/USER-ID" apikey = "API-KEY" creds = Credentials() def disable_ssl(ctx): ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE ssl.SSLContext.verify_mode = property(lambda self: ssl.CERT_NONE, lambda self, newval: None) def get_access_token(apikey): try: headers = { 'Content-Type': 'application/x-www-form-urlencoded', 'Accept': 'application/json', } data = { 'grant_type': 'urn:ibm:params:oauth:grant-type:apikey', 'apikey': apikey, } response = requests.post('https://iam.cloud.ibm.com/identity/token', headers=headers, data=data) return response.json()['access_token'] except Exception as inst: print('Error in getting access token') print(inst) exit ctx = ssl.create_default_context() ## If you require to disable SSL, uncomment the below line # disable_ssl(ctx) transport = thrift.transport.THttpClient.THttpClient( uri_or_host="{host}:{port}{uri}".format( host=creds.host, uri= creds.uri, port=443, ), ssl_context=ctx, ) headers = { "AuthInstanceId": creds.instance_id } if creds.instance_id.isdigit(): # Software installation headers["Authorization"] = "ZenApiKey " + base64.b64encode(f"{creds.username}:{creds.apikey}".encode('utf-8')).decode('utf-8') else: # Cloud installation headers["Authorization"] = "Bearer {}".format(get_access_token(creds.apikey)) transport.setCustomHeaders(headers) cursor = hive.connect(thrift_transport=transport).cursor() print("Connected to Spark Query Server") cursor.execute('show databases') print(cursor.fetchall()) cursor.close()
-
Run using
python connect.py
.