Hosting game servers at scale using Azure Container Instances, using Azure Functions and Event Grid. Demo with OpenArena game server!
This project allows you to manage Docker containers running on Azure Container Instances. Suppose that you want to manage a series of running Docker containers. These containers may be stateful, so classic scaling methods (via Load Balancers etc.) would not work. A classic example is multiplayer game servers, where its server has its own connections to game clients, its own state etc. Another example would be batch-style projects, where each instance would have to deal with a separate set of data. For these kind of purposes, you would need a set of Docker containers being created on demand and deleted when their job is done and they are no longer needed in order to save costs.
Project contains some Functions/webhooks that can be called to create/delete/get logs from Azure Container Instances (called ACICreate
,ACIDelete
,ACIDetails
respectively). There is a Function, called ACISetSessions
, which can be used to set/report running/active sessions for each container. These sessions could be game server sessions or just 'remaining work to do'. When we create a new container, it takes some time for it to be created. When it's done and the container is running successfully, our project is notified via an Event Grid message. This message is posted to the ACIMonitor
method, whose sole purpose is to listen to this messages and act appropriately. There is also a Function (ACIList
) that retrieves a list of the running containers, the number of their active jobs/sessions as well as their Public IPs. This can be used to see the load of the running containers.
There is also a Function (ACISetState
) that enables the caller to set the state of a container. This can be used to 'smoothly delete' a running container. Imagine this, at some point in time, we might want to delete a container (probably the existing ones can handle the incoming load). However, we do not want to disrupt existing jobs/sessions running on this particular container, so we do it call this Function to set its state as 'MarkedForDeletion'. Moreover, there is another Function (ACIGC
) that is called on regular time intervals whose job is to delete containers that are 'MarkedForDeletion' and have no running jobs/sessions on them. To delete them, it calls the ACIDelete
Function.
Finally, we suppose that there is an external service that uses our Functions to manage running Docker containers and schedule sessions on them.
Click the following button to deploy the project to your Azure subscription:
This operation will trigger a template deployment of the deploy.json ARM template file to your Azure subscription, which will create the necessary Azure resources as well as pull the source code from this repository.
You need to specify the following information in order to deploy the project:
The Functions are deployed on a Free App Service Plan, you may need to scale it up for increased performance.
The project uses Managed Service Identity and its relationship with Azure Functions to authenticate to the Azure ARM API Management Service in order to create/delete/modify the Azure Container Instances needed. The deployment script automatically creates an app identity for the Function App, however you need to give this identity permissions to the Resource Group that will host your Container Instances. To do that:
Moreover, as soon as the deployment completes, you need to manually add the Event Subscription webhook for the ACIMonitor
Function using the instructions here. Just make sure that you select the correct Resource Group to monitor for events (i.e. the Azure Resource Group where your containers will be created). This will make the Event Grid send a message to the ACIMonitor
Function as soon as there is a resource modification in the specified Resource Group. As soon as this completes, your deployment is ready. Optionally, as soon as you get the URL of the ACIMonitor
Function, you can use this ARM template to deploy the Event Grid subscription.
When you deploy the Event Grid subscription using the Portal, these are the values you need to fill in:
Last but not least, with the new v2 runtime of Azure Functions, the EventGrid binding extension may need manual registration. Under normal circumstances, the extension will be installed automatically (as it's registered in the extensions.csproj file), but if this does not happen, you can check the following articles on how to do it manually:
We've created a couple of demos so that you can test the project, check the detailed documentation at the DEMOS.md file.
This project allows you to manage Azure Container Instances using Azure Functions and Event Grid. All operations deal with Container Groups, which are the top-level resource in Azure Container Instances. Each Container Group can have X number of containers, a public IP etc. Most Functions are HTTP-triggered unless otherwise noted. Moreover, all HTTP-triggered Functions are protected by 'authorization keys' apart from the ACIList
Function, which needs to be anonymous so it can be called by the 'list servers' HTML page.
As mentioned before, the HTTP-triggered Functions are supposed to be called by an external service (for a game, this would potentially be the matchmaking service). The details of all running container groups/instances are saved in an Azure Table Storage table that is created during deployment. For each container, there is a row that holds data regarding its name (specifically, the container group name), the Resource Group it belongs to, its Public IP Address, the Azure datacenter location it was created on, its CPU/RAM resources, its current active sessions and its state.
In this table, Azure Container Groups can hold one of the below states:
MarkedForDeletion
so that it will be deleted when a) there are no more active sessions and b) the ACIGC Function runsMoreover, if you navigate to the root of the deployment using a web browser (e.g. visit https://your_function_name.azurewebsites.net) you will see a 'list servers' page that displays details about your running servers (Public IPs, ActiveSessions, datacenter Location etc.). The HTML page exists in the ACIList
Function and served via a ?html=something
input in the query string. We're using Azure Functions Proxies to have the root path (/) of the application point to the HTML file using the special query string. Check the proxies.json
file for details.
A typical flow of the project scenario goes like this:
ACICreate
, so a new Container Group is created and is set to Creating
state in the table.ACIMonitor
function, this means that the Container Group is ready. The ACIMonitor
function inserts its public IP into Table Storage and sets its state to Running
.ACIList
to get info about Container Groups in Running
state. The service can use this information to determine current system load and schedule new sessions accordingly.ACIDetails
Function to get logs/debug a running Container or get details about the Container Group.ACISetSessions
to set running sessions count on Table Storage.ACISetState
to set Container Group’s state as MarkedForDeletion
when the Container Group is no longer neededACIGC
(GC: Garbage Collector) will delete unwanted Container Groups (i.e. Container Groups that have 0 active/running sesions and are MarkedForDeletion
). The deletion will happen via the ACIDelete
Function.Important: We take it for granted that the server application will contain code to get access to its state. This way, if its current state is 'MarkedForDeletion', there will be no other sessions on this server when the current workload will finish (e.g. if we're running a multiplayer game server, players will disconnect and return to the matchmaking lobby after the current game complates). This way, Container Instance can safely be removed by the ACIGC
Function.
You will see that we have an ACIAutoScaler
Function, disabled by default (value is set in the ACIAutoScaler\function.json
file). This function attempts to provide an autoscaling mechanism for this project. It consists of a timer triggered one and works according to the simple following logic:
ACIAutoScaler\config.json
file for values regarding if scale in/out is allowed and max sessions per server.ACICreate
Function is called to add another Container Group.ACISetState
Function is called to set the container with the fewest active sessions as MarkedForDeletion
(we take into account that our app/game is clever enough to not schedule any more sessions on this container).For all this to work to work, user (optionally) has to manually fill and/or modify values for the following environment variables:
CONTAINER_GROUP_TEMPLATE
: the ARM template for the container group that will be deployed. You can use the contents of this file as a starting point (you can modify this file and use it for the OpenArena demo).AUTOSCALER_MINIMUM_INSTANCES
(default 1): the minimum number of instances that should exist in our deploymentAUTOSCALER_MAXIMUM_INSTANCES
(default 10): the maximum number of instancesAUTOSCALER_SCALE_OUT_THRESHOLD
(default 0.8): the percentage threshold that, when surpassed, a scale out will happen. For example, if our deployment has 2 servers/instances and each one of them can hold 10 sessions, a scale out operation will take place when there are 17 sessionsAUTOSCALER_SCALE_IN_THRESHOLD
(default 0.6): same as before, but this time for scale inAUTOSCALER_COOLDOWN_IN_MINUTES
(default 10): the number of minutes for a 'cooldown', i.e. the time that should pass after a scale in/out operation till the next oneThis autoscaling is considered pretty basic but can be used as a starting point for you to create your own algorithm and/or establish your own rules.
This project was heavily inspired by a similar project that deals with a similar issue but uses Azure VMs called AzureGameRoomsScaler.
This guides the Kudu engine as to where the source code for the Functions is located, in the GitHub repo. Check here for details.
Indeed, there 4 ARM files on the project. They first three of them are executed in the following order:
Check here for resource group events and here for subscription-wide events.
As always, Azure documentation is your friend, check here. Don't forget that for running containers, you can see their logs via a call to the ACIDetails
method.
Check here to read some details about Azure Function's key management API. You can easily retrieve them from the Azure Portal by visiting each Function's page.
Not proper Function testing on this project (yet), however you can see a 'testing' file on tests\index.js
. To run it, you need to setup an tests\.env
file with the following variables properly set:
Check this page on Azure Event Grid documentation.
Check here for the ARM Template for Container Groups.
You can check the Azure Resource Manager documentation here.
Check here for instructions on how to deploy images that are hosted on Azure Container Registry.
Of course, check here for the allowed options as well as here for the correct restartPolicy
property location on the Container Group ARM template.
For both purposes, the best way to do it would be to fork the project on GitHub and work on it on your own repo/copy. Then, you could easily modify it and either manually deploy it or (even better) use Continuous deployment for Azure Functions.
Check here on how to execute a command from within a running container of a container group. Moreover, you can use Azure Monitor to check for additional runtime metrics, check here for more details.