Standard Spark examples
You can use the following code samples to learn how to use Spark in different situations.
To understand how to access Object Storage, see Understanding the Object Storage credentials.
Reading a CSV file from Object Storage using already stated credentials
The following code samples show you how to create a Python script that reads data from a CSV file to a Python DataFrame. Both the Python script and the CSV file are located in Object Storage.
You can use the same IBM Cloud Object Storage credentials that you specified at the time you submitted the Spark application or that were set as a default configuration when you created the IBM Analytics Engine service instance to read from Object Storage within the application.
Example of the application called read-employees.py
. Insert the Object Storage bucket name and service name. The service name is any name given to your Object Storage instance:
from pyspark.sql import SparkSession
def init_spark():
spark = SparkSession.builder.appName("read-write-cos-test").getOrCreate()
sc = spark.sparkContext
return spark,sc
def read_employees(spark,sc):
print("Hello 1" , spark )
employeesDF = spark.read.option("header",True).csv("cos://cosbucketname.cosservicename/employees.csv")
print("Hello 2" , employeesDF)
employeesDF.createOrReplaceTempView("empTable")
seniors = spark.sql("SELECT empTable.NAME FROM empTable WHERE empTable.BAND >= 6")
print("Hello 3", seniors)
seniors.show()
def main():
spark,sc = init_spark()
read_employees(spark,sc)
if __name__ == '__main__':
main()
Example of the CSV file called employees.csv
that is read by the application called read-employees.py
:
NAME,BAND,DEPT
Abraham,8,EG
Barack,6,RT
Clinton,5,GG
Hoover,4,FF
Kennedy,7,NN
Truman,3,TT
To run the application called read-employees.py
that reads data from employees.csv
POST the following JSON payload script called read-employees-submit.json
. Insert the Object Storage bucket and service
name where the CSV file is located, modify the endpoint path and insert your access key and secret key.
{
"application_details": {
"application": "cos://cosbucketname.cosservicename/read-employees.py",
"conf": {
"spark.hadoop.fs.cos.cosservicename.endpoint": "https://s3.direct.us-south.cloud-object-storage.appdomain.cloud/<CHANGME-according-to-instance>",
"spark.hadoop.fs.cos.cosservicename.access.key": "<CHANGEME>",
"spark.hadoop.fs.cos.cosservicename.secret.key": "<CHANGEME>"
}
}
}
Reading a CSV file from Object Storage using IAM API Key or different HMAC credentials
The following code samples show you how to create a Python script that reads data from a CSV file to a Python DataFrame. Both the Python script and the CSV file are located in Object Storage.
This example shows you how to access IBM Cloud Object Storage using the IAM API key.
Example of the application called read-employees-iam-key-cos.py
. Insert the Object Storage bucket name where the CSV file is located and the modify the endpoint path.
from pyspark.sql import SparkSession
def init_spark():
spark = SparkSession.builder.appName("read-write-cos-test").getOrCreate()
sc = spark.sparkContext
hconf=sc._jsc.hadoopConfiguration()
hconf.set("fs.cos.testcos.endpoint", "s3.direct.us-south.cloud-object-storage.appdomain.cloud/CHANGEME-according-to-instance="
("fs.cos.testcos.iam.api.key","<CHANGEME>")
return spark,sc
def read_employees(spark,sc):
print("Hello1 " , spark )
employeesDF = spark.read.option("header",True).csv("cos://cosbucketname.cosservicename/employees.csv")
print("Hello2" , employeesDF)
employeesDF.createOrReplaceTempView("empTable")
juniors = spark.sql("SELECT empTable.NAME FROM empTable WHERE empTable.BAND < 6")
print("Hello3", juniors)
juniors.show()
def main():
spark,sc = init_spark()
read_employees(spark,sc)
if __name__ == '__main__':
main()
Then POST the following payload JSON script called read-employees-iam-key-cos-submit.json
with your access key and password. Insert the Object Storage bucket name where the CSV file is located and the modify the endpoint path.
{
"application_details": {
"application": "cos://cosbucketname.cosservicename/read-employees-iam-key-cos.py",
"conf": {
"spark.hadoop.fs.cos.cosservicename.endpoint": "https://s3.direct.us-south.cloud-object-storage.appdomain.cloud/CHANGME-according-to-instance",
"spark.hadoop.fs.cos.cosservicename.access.key": "<CHANGEME>",
"spark.hadoop.fs.cos.cosservicename.secret.key": "<CHANGEME>"
}
}
}
Reading and writing to Object Storage using IAM in Scala
- Create an Eclipse project of the following format:

-
Add the following application called
ScalaReadWriteIAMCOSExample.scala
to theanalyticsengine
folder. Insert your IAM API key and the Object Storage bucket name.package com.ibm.analyticsengine import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ object ScalaReadWriteIAMCOSExample { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("Spark Scala Example") val sc = new SparkContext(sparkConf) val prefix="fs.cos.cosservicename" sc.hadoopConfiguration.set(prefix + ".endpoint", "s3.direct.us-south.cloud-object-storage.appdomain.cloud") sc.hadoopConfiguration.set(prefix + ".iam.api.key","<CHANGEME>") val data = Array("Sweden", "England", "France", "Tokyo") val myData = sc.parallelize(data) myData.saveAsTextFile("cos://cosbucketname.cosservicename/2021sep25-1.data") val myRDD=sc.textFile("cos://cosbucketname.cosservicename/2021sep25-1.data") myRDD.collect().foreach(println) } }
-
Put
SparkScalaExample.sbt
in thesrc
folder:name := "ScalaReadWriteIAMCOSExample" version := "1.0" scalaVersion := "2.12.10" libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
-
Build the Scala project using sbt. Upload the resultant jar (
scalareadwriteiamcosexample_2.12-1.0.jar
) into Object Storage.cd SparkScalaExample sbt package
Upload jar to Object Storage.
-
Then POST the following payload JSON script called
read-write-cos-scala-submit.json
with your access key and password. Insert the Object Storage bucket name where the CSV file is located and the modify the endpoint path.{ "application_details": { "application": "cos://cosbucketname.cosservicename/scalareadwriteiamcosexample_2.12-1.0.jar", "class":"com.ibm.analyticsengine.ScalaReadWriteIAMCOSExample", "conf": { "spark.hadoop.fs.cos.cosservicename.endpoint": "https://s3.direct.us-south.cloud-object-storage.appdomain.cloud-CHANGME-according-to-instance", "spark.hadoop.fs.cos.cosservicename.access.key": "<CHANGEME>", "spark.hadoop.fs.cos.cosservicename.secret.key": "<CHANGEME>" } } }
Submitting application details with IAM API key
The following example show the application details for the application called read-employees.py
that reads data from employees.csv
in Object Storage using the IAM API key. Insert the Object Storage bucket and service
name where the CSV file is located and the API key.
{
"application_details": {
"application": "cos://cosbucketname.cosservicename/read-employees.py",
"conf": {
"spark.hadoop.fs.cos.cosservicename.endpoint": "https://s3.direct.us-south.cloud-object-storage.appdomain.cloud-CHANGME-according-to-instance",
"spark.hadoop.fs.cos.cosservicename.iam.api.key": "CHANGME"
}
}
}