Julia on Google Compute Engine: working with files

This is the second part of the Julia on Google Compute Engine (GCE) series that I started a few weeks back. The first entry addressed how to set up Julia on a standard GCE instance, and how to run simple scripts. Today, we'll be doing less with Julia, and more with setting things up so you can efficiently write Julia programs and scripts, and make it easy to get the results of your computations.

Anatomy of a setup

There's the typical disclaimer to be had here, that the optimal setup depends largely upon personal preference and what you're trying to accomplish. My general process is to write locally, validate against small data sets, then run on a high power instance with the output in an arbitrary object store.

Your process and setup may vary, but it's likely that you'll want to write access files on your virtual machine, and make your results accessible outside of the machine. The easiest way to do this is to use Google Cloud Storage (GCS), a highly available aribtrary object store. Files written to GCS can be public, or ACL'd to a specific set of users and/or applications.

Let's explore using Google Cloud Storage within our Julia program.

Creating a bucket on Cloud Storage

After installing gsutil or the Cloud SDK and configuring it to work with your project, you're set to start using the command line. The first step to using GCS is to create a bucket (or two) for your files. It may be helpful to separate your scripts from your results. You can do this using the command line, or using the Developers Console. Let's create one bucket:

 $ gsutil mb gs://<bucket-name>  

You can also do this from the Developers Console, by navigating to Cloud Storage, clicking on New Bucket, and entering your bucket name:

Copying files from Cloud Storage to your instance

When you created your instance (whether using the command line, or using the Developers Console), you gave it permission to GCS by creating it with read/write access. This was accomplished by using service accounts. Service accounts are scoped the same way that people are, except that they represent software instead of a person. They can be used for authentication to other Google services, in this case, Cloud Storage. Our service account was given access to read and write files from and to GCS.

This means that grabbing files from Cloud Storage is as easy as using gsutil while ssh'd into your instance:

 $ gsutil cp gs://<bucket-name>/<filename> .  

Using data files from Cloud Storage within Julia, part 1

That's useful, but it still requires a manual step. Can we do that from inside a Julia program? There's a download function in stdlib after all. Turns out, no, you can't. That's because the download function uses curl or wget, not gsutil, which knows how to make use of the service account. However, you can exercise arbitrary command line tools from within Julia -- command line tools including gsutil.

Let's assume that we have a file, 2x2array.csv uploaded to GCS within our data folder, and we want to use it in our program. We can do this in a couple of steps:

Download the file

 run(`gsutil cp gs://<bucket-name>/julia-data/2x2array.csv .`)  

Read the file

 twoxtwo = readdlm("2x2array.csv", ',')  

Now, you're ready to manipulate the data you downloaded. That's still a little hacky though, isn't it? Can we do it without gsutil?

Using data files from Cloud Storage within Julia, part 2

gsutil isn't the only program that can use the service account. Our Julia program can as well. We'll be using a package called HTTPClient as well as a JSON Parser to do this. You can learn how to add/remove/otherwise manage packages in the documentation.

 using HTTPClient.HTTPC  
 using JSON  

To be able to access Cloud Storage from within our program, we first need to grab an access token.

 url = "http://metadata/computeMetadata/v1/instance/service-accounts/default/token"  
 request = HTTPC.get(url, RequestOptions(headers=[("X-Google-Metadata-Request", "True")]))  
 access_token = JSON.parse(bytestring(request.body))["access_token"]  

Now that we have our access token, we can fetch our data file(s):

 bucket_name = <bucket-name>  
 file_name = <data-file-name>  
 project_number = <project-number>  
 request_url = string("https://www.googleapis.com/storage/v1beta2/b/", bucket_name, "/o/", file_name, "?alt=media")  
 request = HTTPC.get(request_url, RequestOptions(headers=[("Authorization", string("OAuth ", access_token)),  
                                                            ("x-goog-api-version", "2"),  
                                                            ("x-goog-project-id", project_number)]))  

From here, we can write the file locally:

 fp = open(<temp-file-name>, "w")  
 write(fp, takebuf_string(request.body))  

A couple of notes:

  • You'll need to replace <bucket-name>, <data-file-name>, <project-number>, and <temp-file-name> with your own information
  • The ?alt=media tag indicates that you want to actually fetch the file

You can now manipulate data fetched from GCS directly within your program.

Pushing files to Cloud Storage within Julia

No matter what you do with Julia, you'll likely produce some sort of output file -- whether it be raw numeric data, or plots, or images. So, where do you stash the results? Considering that you've already got a bucket and access token, you probably want to write directly to GCS. Let's take a look at an example, visualizing components of a classic data set with Gadfly.

Add and load the packages:

 Pkg.add("Gadfly")  
 Pkg.add("RDatasets")
 using Gadfly  
 using RDatasets

Load up the wine data into a data frame:

 wine_data = readtable("wine.csv")  

Visualize it with Gadfly and export to png:

 alcohol_vs_magnesium = plot(wine_data, x="Alcohol", y="Magnesium", Geom.point)  
 png_name = "alcohol_vs_magnesium.png"  
 draw(PNG(png_name, 6inch, 3inch),alcohol_vs_magnesium)  

Get the length of the resulting file:

 file_len = strip(readall(`stat -c %s $png_name`))  

Upload it to GCS:

 url = string("https://www.googleapis.com/upload/storage/v1beta2/b/", bucket_name, "/o?uploadType=media&name=",png_name)  
 options = RequestOptions(headers=[("Authorization", string("OAuth ", access_token)),("Content-Type","image/png"),("Content-Length", string(file_len))])  
 request = HTTPC.post(requestURL, (:file, png_name), options)  
 println(request.http_code)  

If your http code is 200, then all went according to plan, and your image should now be in your specified bucket, and look something like this:

Fair warning: I don't attest to the significance of this chart.

Wiring up the pieces

This blog entry only really addresses getting data in and out of Julia, not the details within. However, as Julia has no client library for GCS (or any Google API, for that matter), these instructions may help folks. I haven't decided what the next entry will cover, but I'm hoping to get more into actual Julia-specific code.

For those who want to see a more complete picture of what we did, here's the complete code. Please note that you will have to enter some information, such as bucket name, project number, etc... You will also need to download the wine data set and add labels to the csv file, if you plan on using it. However, there are plenty of fantastic data sets within the RDatasets package, so consider playing around with them!

Popular posts from this blog