RealityServer » Cost Effective Rendering with Amazon SQS and S3

Cost Effective Rendering with Amazon SQS and S3

Even though RealityServer is great at streaming fully interactive, server-side rendering directly to your browser, not every use case requires this level of interactivity. RealityServer has recently introduced a new feature called the Queue Manager which integrates with popular message queue services to manage rendering and other RealityServer tasks. In this article we will dive into the details on how to get up and running with this great new feature using Amazon SQS and S3 services.

Why Queuing?

RealityServer will normally run all requests it is sent in parallel. Often times this is exactly what you want, particularly in a multi-user or collaborative application. However what about use cases where you don’t need full interactivity or live streaming and want to carefully manage your spend on GPU server resources? Or cases where you expect to get a large number of simultaneous requests, more than can be handled at once.

The normal solution to this problem is queuing. You implement a job queue in front of RealityServer that takes in those jobs and then runs them in a controlled manner instead of all at once. This is such a common practice in web applications (e.g., for tasks like resizing images or transcoding video) that various services have sprung up to help with this.

Enter the message queue. Basically a message queue allows two programs to interact in an asynchronous manner. In our case, jobs are submitted to the queue (either by RealityServer itself or externally) and then at some later time they are retrieved by a listening RealityServer to execute. Since each listening RealityServer will only process a single job at a time, the queue effectively controls the flow of jobs to RealityServer.

This also makes scaling out trivial. You just add more listening RealityServer instances and they will automatically pull jobs from the queue that others are not working on. This makes job distribution a piece of cake. All you need to do is start another RealityServer if you want to get through the jobs quicker.

In this article we are going to talk specifically about using Amazon SQS as the underlying message queue and Amazon S3 for storing the results. If you have a particular message queue or storage platform you’d like to see supported let us know.

Amazon SQS

Amazon SQS or Simple Queue Service is a robust implementation of a message queue, provided as a turn-key service by Amazon Web Services (AWS). This is the first external queuing service we are offering support for in RealityServer. We’ve implemented the Queue Manager in a modular way and anticipate adding additional integrations over time.

SQS is easy to setup directly from the AWS web based console (though you can also do it through the AWS API if needed). We provide some instructions on setting up your SQS queues for use with RealityServer here. The rest of this guide assumes you have already setup your queue.

Amazon S3

It’s very likely you’ve already heard of Amazon S3 or at least downloaded files hosted there. S3 is a service for storing objects in the cloud. You can simply think of it as a place to store files. A big advantage of S3 is that you can make these files immediately accessible through a simple URL, without the need for a separate web server. You can also integrate S3 storage with Content Delivery Networks like Amazon CloudFront.

In this article we’ll use S3 as a place to store the results of the jobs we are going to queue and run. We also provide a generic HTTP Postback task you can use if S3 isn’t to your liking but we’ll stick with the AWS theme here.

Terminology

Let’s get a few basic terms out of the way before we jump into setting this up. Since some of these are quite similar and easy to confuse (particularly job and task) we thought we better make them clear.

Queue Module

The queue module provides access to a specific type of queue platform (e.g., Amazon SQS). You’ll configure a particular queue module depending on what service you’re using. Right now we’ll just cover SQS.

Queue

A queue is the place where jobs are stored. The queue manages the life cycle of a job (see below). The queue is externally hosted and not part of RealityServer.

Command

The RealityServer command that will be called by a job to perform the actual work required. Jobs call a single command and this is just a regular RealityServer command (e.g., a custom V8 command you have written).

Parameters

When calling a command in a job you usually will want to pass in parameters to the command, just like when calling a command directly. The parameters are included in the job and also stored in the queue.

Task

The action you want to be performed after a queued job completes. This might be something like uploading an image rendered by the job to a server.

Job

A job is a combination of a command, parameters and tasks. The job is what is inserted into the queue and what gets pulled off the queue when work is to be done.

Configuration

Most of the work required to get going with the queue manager configuration. The queue manager and AWS services we support have a set of new configuration directives you can use in realityserver.conf to control how they work. For these examples we’re going to make the following assumptions. Obviously you’ll need to modify for your configuration.

An SQS setup with the queue URL https://sqs.us-east-1.amazonaws.com/597971865187/RS_jobs
An S3 bucket setup with the bucket name rs_example_bucket in the us-east-1 AWS region.
A custom V8 command called rs_fancy_render which takes a single parameter scene_file and returns a binary.

Queue Manager

The queue manager is configured using the queue_manager user directive. Here is a quick example.

<user queue_manager>
    allow_command rs_fancy_render
    allow_command get_version
    default_queue rs_sqs_queue
    listen_queue rs_sqs_queue
</user>

The allow_command directive tells RealityServer which commands can actually be queued as a job. You can allow multiple commands but each job can only call a single command. For that reason the most likely scenario would usually be that you would write a custom V8 command which performs the actions you want a job to take. Above we allow a custom V8 command called rs_fancy_render as well as the built in get_version command.

When submitting jobs using the queue_manager_submit_job command (more on this later) you don’t have to specify a queue name, if you don’t the queue specified by default_queue will be used.

By default RealityServer will not listen to any queues and so will not receive any jobs or perform any work. If you just want to submit jobs this is fine but if you also want to have RealityServer run jobs you need to add the listen_queue directive. The listen queue and default queue don’t need to be the same, however they often will be in many use cases.

So we reference a queue in the configuration above by name, rs_sqs_queue, in two places but where is that defined? The actual queues are defined in directives for the specific queue implementation being used. In our case that calls under a specific AWS configuration. Let’s set that up now.

AWS

In our example the AWS configuration is going to setup both the SQS based queue and the S3 based storage for our upload tasks. Right now SQS and S3 are only used with the queue manager but we intend to expose them to other parts of RealityServer down the road and so have put the AWS configuration into its own set of directives. Here is an example for AWS configuration.

<user aws>
    access_key PGRM8R6PU2JTH6ATO72O
    secret_key P+gVZOPXBsYCzmZJxaduo/lF/nLvGcrCZNu/ReAJ
    <user sqs>
        <user queue>
            name rs_sqs_queue
            queue_url https://sqs.us-east-1.amazonaws.com/597971865187/RS_jobs
        </user>
    </user>
    <user s3>
        <user bucket>
            name rs_s3_bucket
            bucket rs_example_bucket
            region us-east-1
        </user>
    </user>
</user>

Those may looks like real AWS key and secrets but you’ll be disappointed if you try them, of course we are not that silly, they are just random examples.

As you can see there is an enclosing aws user directive for all AWS related functionality. You need to provide your access_key and secret_key which can be obtained or generated from your IAM console. These allow RealityServer to talk to your AWS account to use the services. You can restrict the access credentials to only allow access to the specific services you need and we definitely recommend doing so.

In the nested sqs user directive you’ll find where our named queue is defined. All that is needed is a name and queue_url. The name is what is used elsewhere in the configuration to refer to the queue and this can be whatever you choose. The queue URL is obtained from the SQS console or API and is what uniquely identifies the queue (yes, the above is also just a random example). You can define multiple queues in your configuration if you want to be able to submit jobs to many queues however you can only listen to a single queue at a time.

Another nested user directive is added for s3 which allows us to define an S3 bucket where job results will be uploaded with the s3_upload task (more on that later). Here you need to provide a name, bucket and region. Like the queue name the name for the bucket is just used for RealityServer so you can choose your own name. The bucket must correspond to the name of the bucket you created in S3 and since buckets in S3 are stored in a specific region you also need to set the region directive.

You can specify multiple buckets if you wish. The buckets are not referenced elsewhere in the configuration but rather used when defining tasks when submitting jobs. If you want to know more about S3 and are wondering what buckets, regions and keys are start with the AWS guide here.

Submitting Jobs

Now that you’re setup restart RealityServer and let’s try submitting a job. The easiest way to do this is with the built in queue_manager_submit_job command. It is possible to submit jobs directly to queue by inserting a message with the right payload but we will not cover that advanced topic in this post. When getting started we always recommend using the command since it also validates what you are submitting.

Here is a really simple example which just calls the get_version command and uploads the result to S3. This is the JSON-RPC version of the command but of course you can call it any way you call RealityServer commands (e.g., with the realityserver-client).

{"jsonrpc": "2.0", "method": "queue_manager_submit_job", "params": {
	"command": "get_version",
	"parameters": {},
	"queue_name": "rs_sqs_queue",
	"tasks": [
		{
			"name": "s3_upload",
			"config": {
				"bucket": "rs_s3_bucket",
				"key": "tests/version-${message_id}.${mime_ext}",
				"acl": "public-read"
			}
		}
	]
}, "id": 1}

There is a bit to unwrap there but it’s all pretty simple. The command and parameters of course basically specify what you want to run. In this example there are no parameters and note that the command is one of those we explicitly allowed in our configuration. We’ve also specified the queue_name to refer to the one we configured. We could have left that out in our case since we have a default_queue set in the configuration.

The real meat in the above is in the tasks parameter. This is provided as an array of tasks which will be run after the command specified has completed. You can perform multiple tasks and each needs to specify a name and a config map which contains the settings for the task. Currently the supported task names are.

s3_upload
http_post

As we’re sticking with S3 for this article we’ll only cover that one but you can read about http_post in the documentation. The s3_upload task expects the following keys in its config map.

bucket – this is the name we gave the bucket in our configuration earlier and not the bucket name at Amazon.
key – this is what to call the object we are going to upload. Key is an S3 term and is basically what uniquely identifies the object in the bucket.
acl – optionally you can specify an Access Control List which controls who can do what to the object. See the documentation for full details, by default objects are private, if you want anyone to be able to read them then you can set acl to public-read. Note your bucket needs to be setup to allow this.

They key configuration deserves a little more explanation. You can see in our example it includes two string substitutions for message_id and mime_ext. These will be automatically replaced with the message id of the job that was given when it was submitted and the extension part of the MIME type of the result data.

You’ll note also that in our example we specify what looks like a directory path at the start of our key. In S3 there isn’t really a concept of directories, all objects are stored directly in a bucket, however when a key contains forward slash characters most tools (including AWS tools) which access S3 treat these like directories. So for all intents and purposes you can think of it just like a directory.

Testing it Out

Everything is configured and you know how to submit jobs. So it’s time to try it out and see how it goes. When you start up RealityServer with the configuration we made earlier you’ll notice a new log message buried in the start up messages.

QUEUE  main info : [rs_sqs_queue] Queue configured for listening. Waiting for job.

Check that you are seeing this, it confirms that RealityServer is listening to that queue for new jobs and will run them when seen. Now we’ll submit the job we defined earlier using cURL to submit our JSON-RPC request (or your other favourite HTTP tool like Postman).

curl -X POST -H "Content-Type: application/json" -d '{
    "jsonrpc": "2.0", "method": "queue_manager_submit_job", "params":
    {
        "command": "get_version",
        "parameters": {},
        "queue_name": "rs_sqs_queue",
        "tasks": [
            {
                "name": "s3_upload",
                "config": {
                    "bucket": "rs_s3_bucket",
                    "key": "tests/version-${message_id}.${mime_ext}",
                    "acl": "public-read"
                }
            }
        ]
    },
    "id": 1
}' "http://localhost:8080/"

Of course substitute the localhost for your particular server name if not testing locally. You should get back a response from the server that looks something like this.

{"id":1,"jsonrpc":"2.0","result":"a2ce13a7-a666-407a-b9cf-a8b576717bf1"}

The result is the message id we mentioned earlier. If the job was not able to be submitted for some reason then the command will have given an error. Note that just because a job was submitted doesn’t mean it will work. We’ll cover that topic a little later. If you were watching the logs on your RealityServer when you submitted the job you would see output similar to this.

QUEUE  main info : [rs_sqs_queue a2ce13a7-a666-407a-b9cf-a8b576717bf1] Job successfully queued.
QUEUE  main info : [rs_sqs_queue a2ce13a7-a666-407a-b9cf-a8b576717bf1] Received job from queue.
QUEUE  main info : [rs_sqs_queue a2ce13a7-a666-407a-b9cf-a8b576717bf1] Executing command: get_version.
QUEUE  main info : [rs_sqs_queue a2ce13a7-a666-407a-b9cf-a8b576717bf1] Executed command: get_version.
QUEUE  main info : [rs_sqs_queue a2ce13a7-a666-407a-b9cf-a8b576717bf1] Executing task: s3_upload.
QUEUE  main info : [rs_sqs_queue a2ce13a7-a666-407a-b9cf-a8b576717bf1] Executed task: s3_upload.
QUEUE  main info : [rs_sqs_queue a2ce13a7-a666-407a-b9cf-a8b576717bf1] Job processed, removing from queue.

If you were queuing jobs on one RealityServer and they were running on another obviously you’d see a portion of these on each server. Assuming you get output similar to the above then you should find that RealityServer has deposited a new object in your S3 bucket.

For buckets where you allow setting public access and you use the public-read option for the ACL you can just navigate to the URL to look at the object. For the above the URL would look like this.

http://rs_example_bucket.s3.amazonaws.com/tests/version-a2ce13a7-a666-407a-b9cf-a8b576717bf1.json

Notice the message id and extension have been replaced. Since the get_version command does not return a binary (like a rendering command might) the response is a JSON file. The content corresponds to the result property of the JSON-RPC response. For this test it will look something like this.

"5.3, build 2593.256, 08 Oct 2019, linux-x86-64"

Assuming you have something like this in your bucket then you’ve successfully configured and tested the queue manager.

More Complex Example

So everything is functioning correctly, how about something a bit more complex. Assuming we have written a custom V8 command called rs_fancy_render let’s call that (we’ll just use cURL again).

curl -X POST -H "Content-Type: application/json" -d '{
    "jsonrpc": "2.0", "method": "queue_manager_submit_job", "params":
    {
        "command": "rs_fancy_render",
        "parameters": {
            "scene_file": "scenes/meyemII.mi"
        },
        "queue_name": "rs_sqs_queue",
        "tasks": [
            {
                "name": "s3_upload",
                "config": {
                    "bucket": "rs_s3_bucket",
                    "key": "renders/meyemII-${message_id}.${mime_ext}",
                    "acl": "public-read"
                }
            }
        ]
    },
    "id": 1
}' "http://localhost:8080/"

So this is very similar to our last example but we have added a parameter for our command, changed the command and changed the key for our bucket. If we submit this the process will work as before, with one difference. Since our rs_fancy_render command returns a binary, it will upload an actual image to S3. So the URL would now look more like this.

http://rs_example_bucket.s3.amazonaws.com/renders/meyemII-877b4510-cae2-40b5-bb3e-c28ee82c68d5.png

You can probably already start to see the power of this. If you are using a public-read ACL then the above would be readable as soon as it has been uploaded. So you can easily build a fleet of servers that take jobs in and fill up your S3 buckets with results without any dedicated management servers or custom code at all. This makes scaling very straight forward.

Scaling Out

One of the biggest advantages of the queue manager system is that you can potentially scale up and down from 0 to any number of servers depending on your needs, dynamically in response to load. This can be built without custom development using services such as AWS Auto Scaling, AWS EC2 and AWS CloudWatch. Amazon actually have an article explaining just this topic.

With some careful setup it is also possible to use AWS EC2 Spot Instances with the queue manager. This works well since if an instance is killed off in the middle of a job, the job will timeout and be placed back on the queue. It will then get picked up by another server and rendered. Since spot instances can be terminated without notice this is critical to making it work and managing this behaviour without the queue manager requires a lot of custom development. Where you need to be careful is ensuring that jobs always have somewhere to run successfully eventually, even if delayed, otherwise they will end up in the dead letter queue.

It is important to note that SQS and S3 an be used regardless of where you are actually running your GPU servers. So even if you are self-hosting your GPU server resources you can still take advantage of the queue manager using SQS and S3.

Handling Errors

One of the good reasons to use the queue_manager_submit_job command instead of manually inserting things into the queue is that it will tell you if a job was not able to be submitted and the reason. If submission fails though you find out immediately, what about when the actual command from the job throws an error, or the task which runs after the command finishes?

Once a job is placed in the queue you no longer have direct access to any errors that are thrown. If the command or task fails for some reason then the job is returned to the queue in an error state. Assuming you have a Dead Letter Queue setup (which you should), the job will be retried (possibly on a different RealityServer) up to the maximum number of times specified in your Redrive Policy. Once this number is exceeded the job is deleted from the queue and moved to the dead letter queue.

It is important to monitor the dead letter queue (which is just another SQS queue). If you are seeing a large number of jobs ending up here then it would point to an issue with your service that needs correction. You can use services like CloudWatch to help with this. You’ll usually need to combine this with examining RealityServer log output to find the root cause of the problem.

Go Out and Start Queuing

The latest RealityServer release contains the queue manager functionality so you can get started with it right now. We covered a lot of ground here but when you boil it down to the configuration and job submission it’s actually really simple to setup. Of course if you have any difficulty getting going with queuing reach out to us.

Paul Arden

Paul Arden has worked in the Computer Graphics industry for over 20 years, co-founding the architectural visualisation practice Luminova out of university before moving to mental images and NVIDIA to manage the Cloud-based rendering solution, RealityServer, now managed by migenius where Paul serves as CEO.

October 22, 2019

Articles Tutorials