It sounds like such a simple question and it is asked by almost every new customer we engage. In reality, answering this can get complex quickly, however there are a few basic ways to get some quick estimates. What we are really talking about here is Capacity Planning. As a more general reference on the subject of planning for scalable websites you could take a look at many of the books out there, one now older but still great reference is Cal Henderson’s Building Scalable Websites.
Before setting out we should clarify that at migenius we are typically working with websites and applications with very different compute requirements to your average website. Normally servers are used for serving pages, database operations, perhaps running search engines and other tasks that would occupy only a small portion of any servers resources when servicing a single users requests.
In stark contrast, the portion of a website using our technology is usually associated with Photorealistic Rendering, an extremely complex and compute-intensive operation. To get the speed needed migenius utilises NVIDIA Iray and servers with NVIDIA GPU hardware, making these deployments quite atypical in the world of web application development.
Our technology always targets the maximum performance possible for any given task, even when servicing only a single user. As a result it will quickly soak up any available resources it can in pursuing this goal. This means that most of the time we fully utilise any server which is running our software. Serving a webpage or doing database operations on the other hand usually only utilises a small fraction of the available server resources (allowing many users to use the same server).
Because of these differences we will outline a process for making an estimate of the required servers which we use ourselves when working with customers who are adopting our RealityServer technology. This process should be applicable to any application which also has a heavy compute requirement.
So, here is a fairly straight forward process in six steps to get an estimate of the number of servers you might need to support a compute-intensive (on the server-side) web application.
While you might choose to completely reconfigure your hardware selection later on in the process, you need to get a baseline performance on a particular, known configuration. If you have relative performance data for other hardware configurations you can then estimate how using different hardware will affect your application.
You should pick a hardware configuration that at least gets you your desired performance for a single user. This hardware configuration could be a single server, it could be a cluster, it doesn’t really matter as long as it’s a known configuration and it can get you your basic performance.
Since we are focused here on compute-intensive applications the amount of computation power needed typically depends entirely on the dataset chosen. There is no point choosing a simple dataset if your application will be processing something much more complex, select something that is representative of what you expect to see when you deploy.
If your datasets vary a lot in complexity, you might want to select multiple datasets. When you don’t have a dataset to use you might need to create a synthetic dataset which approximates something of a similar complexity to the real dataset. Whatever you do, don’t use something trivial since it will only mean you greatly underestimate your resource requirements.
To get a realistic estimate of the required hardware, you really need to think about what the minimum required performance you want to see will be. This might not be the average or typical performance but it should represent the minimum level at which you consider your user is still getting an acceptable experience.
For our example, let’s say that we measure performance in thousands operations per second (performance can be anything you can define and measure), so higher numbers are better in this case. We are not talking about what these operations might be here because we want to keep it generic but it really doesn’t matter. Now, for arguments sake we will say that for our application we must have at least the following performance:
10.00 thousand operations per second
Below this threshold we will assume the user experience suffers to the point of being unacceptable.
Now this part can be tricky, particularly since it really depends on whether you have any information from an existing application or website you can use, or if you need to try and make an educated guess based on what you know about your application. This number, however, is also one of the most important, so you should do whatever you can to make a good estimate.
If you do have an existing website with a lot of traffic, your web analytics can help a lot in identifying usage patterns that can lead you to an accurate estimate. Usually your compute-intensive application will only be a small portion of what is done on your website, so you should be careful not to just look at things like monthly unique visitors, since this will lead to greatly overestimating the traffic that would reach your application.
You will also need to decide at this stage whether to plan for the peak or the average. If you want to plan for the peak (maximum simultaneous users) then you may obviously end up with a lot of unused hardware capacity during non-peak times. The flip side is if you plan for the average peaks will generally overwhelm your resources and some users are going to miss out.
If you are deploying a system with the ability to scale up and down the resource allocation, you can potentially avoid being caught out without enough resources if you are able to detect when the peaks are coming before they arrive. Auto-scaling is topic for another post, however it is well-supported by most cloud providers.
For our discussion here let’s assume the following maximum number of simultaneous users that need to be supported.
25 simultaneous users
Keep in mind these are users that are actively using the compute-intensive portion of the application at exactly the same time. This will typically be a much smaller number than the number of overall visitors to your site (unless the compute intensive application is the entire reason for people to visit your site).
Once you have your hardware, dataset, performance level and simultaneous users, you need do some test runs and measure the performance and utilisation. Here is a typical sequence you might go through.
At each step you should review the performance achieved and stop at the point where it goes below your defined Minimum Acceptable Performance. Then take the count of simultaneous users from the previous run and that will be the number of users your base hardware configuration can support. We’ll call that Users per Server.
So now we have all of the pieces we need to calculate how many servers we will need to provision. For our example we will say we obtained the following performance measurements from Step 5 as an example:
With our Minimum Acceptable Performance of 10.00 we can see that our base hardware configuration can support no more than 2 users simultaneously, so Users per Server in this case would be 2. If we were to try and run 3 users then the performance would drop below the level we can tolerate. Here is the simple formula for calculating the number of servers.
Simultanious Users / Users per Server = Servers Required
Plugging in our example numbers we get.
25 simultaneous users / 2 users per server = 13 servers
Naturally you can’t have half a server so we always round up when talking about server counts.
We glossed over the estimation of simultaneous users fairly quickly, but this really is a complex question and something you should think about carefully before making your estimates. When we ask customers for this data to help them with their planning, the first response is often to provide the monthly unique visitor count for the site, which is a popular metric.
These numbers are typically very large, even for a moderately successful B2C site it is likely to be on the order of 40,000 and for large brands it’s not uncommon to see millions of unique visitors each month. This metric is a good starting point if it’s all you have, however it would be better if you have information on how many of those users get past the first page and start interacting with your site it will be better though.
What you need to do is really drill down on how many of the users that come to your site are likely to get far enough into the site to interact with the compute-intensive part of your application. For a custom apparel site, for example, the customer usually needs to browse categories and then available products to make a selection they want to customise, and this process alone will usually filter out a huge number of those initial users.
Once you have an idea how many users will actually get to the compute intensive bit, you can look at how well distributed over the month they are and try to get a feel for the peaks to get a count for maximum simultaneous users. As you can probably already tell, there is some art to getting to this number; you can’t get too scientific unless you are analysing an application that is already running.
By running your tests on different base hardware configurations with known costs, you can start to get a picture for the value for money each of these configurations offers in getting to your needed performance. We will switch to some RealityServer specific examples here but, again, this can be easily adapted for other applications.
We will use GPU instances offered on Amazon EC2 as an example. Let’s compare the following instance types.
The cg1.4xlarge instance has 2 x NVIDIA Tesla M2050 GPUs while the g2.2xlarge instance has 1 x NVIDIA Geforce GRID K520 GPU.
So one is faster but more expensive and one is slower but cheaper, which is pretty much what you would expect. For our example let’s say we need a minimum performance of 0.80, so it is clear that the g2.2xlarge machine will only be able to support a single user. We might then run our tests with multiple users on the cg1.4xlarge instance and get the following:
So, we can support 2 users on that server. However, let’s now look at the costs:
Or, normalising for a single user:
So, for this minimum acceptable performance, the g2.2xlarge instance type is actually more cost effective, so, it pays to do this exercise as well. As you can see, it’s important to look at performance, cost and how many users can be supported.
When deploying a RealityServer application, the computationally intensive task which is being performed is 3D rendering. In RealityServer we offer three different rendering modes.
Each of these behaves very differently on the server in terms of utilisation and therefore capacity. Iray Photoreal for example will eat up any hardware you give it and utilise it entirely, including multiple GPUs, so generally doubling the number of users gives about half the performance of each user.
Iray Interactive on the other hand tends not to completely utilise all hardware and doesn’t effectively use multiple GPUs or clustering in most cases, so a machine with multiple GPUs may allow multiple users without degrading performance until the number of users exceeds the number of GPUs; even then, Iray Interactive will tend to degrade in a sub-linear fashion, so two users will generally get a bit better than half the performance each.
Iray Realtime is OpenGL based and typically the overheads of the network transmission dominate over rendering time so it is often possible to support many users on even a single GPU. It doesn’t effectively use multiple GPUs so as with Iray Interactive, performance will only degrade after the number of users exceeds the number of GPUs, however, with Iray Realtime, even then performance will degrade more slowly.
For any reasonably sized application with a decent user base you will likely have multiple servers. We haven’t yet talked about how you will actually spread the load around those multiple servers. This topic is too detailed to cover here and we will look deeper into this separately and specifically for RealityServer which has specific features for interacting with load balancers.
Just keep in mind that in addition to your server managing the compute load, you will also need to think about having something that sits in front of these to manage traffic and who uses what resources. This is typically done with a load balancer such as haproxy, or if you are using AWS, their (Elastic Load Balancer)[http://aws.amazon.com/elasticloadbalancing/] offering. Usually you will need at least 1-2 extra servers at least for this purpose, but fortunately they don’t need to be too powerful unless you have a very large number of backend servers to manage.
Hand in hand with Load Balancing is dealing with Failover and Redundancy, again this is a bigger topic that we can cover here but in a robust application deployment you will want a minimum of two front end servers or load balancers in case one fails. You will also want to think about how failure of compute nodes gets handled and communicated to the load balancers. You want to avoid having a Single Point of Failure in your deployed application.
The process just described will get you a very basic estimate for an application with one particular way of operating. If your application has many different functions with different performance characteristics you may have to run multiple tests and combine the results to reach your server count.
Of course, the estimate is only as good as the quality of your input data, in particular your estimate of the Maximum Simultaneous Users. You should continuously re-evaluate your estimates as more information becomes available and this shouldn’t stop once your site is live, at least then you can substitute estimates for real data from your running site.
Fortunately these days it is a lot easier than it used to be to add and remove capacity as you need it, thanks to the proliferation of IaaS (Infrastructure as a Service) cloud providers such as Amazon EC2, Nimbix, Penguin Computing, Peer1 and Rackspace.