Using cloudyr for data science
When you use the R
programming language for your projects, get the most out of the features for supporting data science from IBM Cloud® Object Storage
by using cloudyr.
This tutorial shows you how to integrate data from the IBM Cloud® Platform within your R
project. Your project will use IBM Cloud Object Storage for storage with S3-compatible connectivity in your project.
Before you begin
We need to make sure that we have the prerequisites before continuing:
- IBM Cloud Platform account
- An instance of IBM Cloud Object Storage
- `R` installed and configured
- S3-compatible authentication configuration
Create HMAC credentials
Before we begin, we might need to create a set of HMAC credentials as part of a Service Credential by using the configuration parameter {"HMAC":true}
when we create credentials. For example, use the IBM Cloud Object Storage CLI as shown here.
ibmcloud resource service-key-create <key-name-without-spaces> Writer --instance-name "<instance name--use quotes if your instance name has spaces>" --parameters '{"HMAC":true}'
To store the results of the generated key, append the text, > cos_credentials
to the end of the command in the example. For the purposes of this tutorial you need to find the cos_hmac_keys
heading with child keys,
access_key_id
, and secret_access_key
.
cos_hmac_keys:
access_key_id: 7xxxxxxxxxxxxxxa6440da12685eee02
secret_access_key: 8xxxx8ed850cddbece407xxxxxxxxxxxxxx43r2d2586
While it is best practices to set credentials in environment variables, you can also set your credentials inside your local copy of your R
script itself. Environment variables can alternatively be set before you start R
using an Renviron.site
or .Renviron
file, used to set environment variables in R
during startup.
You will need to set the actual values for the access_key_id
and secret_access_key
in your code along with the IBM Cloud Object Storage endpoint for your instance.
Add credentials to your R
project
As it is beyond the scope of this tutorial, it is assumed you already installed the R
language and suite of applications. Before you add any libraries or code to your project, ensure that you have credentials available to connect
to IBM Cloud Object Storage. You will need the appropriate region for your bucket and endpoint.
Sys.setenv("AWS_ACCESS_KEY_ID" = "access_key_id",
"AWS_SECRET_ACCESS_KEY" = "secret_access_key",
"AWS_S3_ENDPOINT" = "myendpoint",
"AWS_DEFAULT_REGION" = "")
Add libraries to your R
project
We used a cloudyr
S3-compatible client to test our credentials resulting in listing your buckets. To get additional packages, we use the source code
collective known as CRAN that operates through a series of mirrors.
For this example, we use aws.s3 as shown in the example and added to the code to set or access your credentials.
library("aws.s3")
bucketlist()
Use library methods in your R
project
You can learn a lot from working with sample packages. For example, the package for Cosmic Microwave Background Data Analysis presents a conundrum. The executable of the project for local compiling are small enough to work on one's personal machine, but working with the source data would be constrained due to the size of the data.
When using version 0.3.21
of the package, it is necessary to add region=""
in a request to connect to COS.
In addition to PUT, HEAD, and other compatible API commands, we can GET objects as shown with the S3-compatible client we included earlier.
# return object using 'S3 URI' syntax, with progress bar
get_object("s3://mybucketname-only/example.csv", show_progress = TRUE)
Add data to your R
project
As you can guess, the library discussed earlier has a save_object()
method that can write directly to your bucket. While there are many ways to load data, we can use cloudSimplifieR to work with an open data set.
library(cloudSimplifieR)
d <- as.data.frame(csvToDataframe("s3://mybucket/example.csv"))
plot(d)
Next steps
In addition to creating your own projects, you can also use R Studio to analyze data.