An exploration of integration testing in a serverless environment

One of the difficulties we have had with serverless development is the scope of integration testing. This blog will take you through the exploration we have done around integration testing and what has evolved as our current solution.

Unit tests

I'll start with the Unit tests; this part is relatively straightforward. Unit tests (in Mocha or jest) cover every single line of code written in our Lambdas. They can test red and green paths with mocks setup for anything outside the Lambda scope.

It seems the consensus, based on my research, is that the content of the StepFunction should not be covered by unit tests and instead fall under integration.

Integration Tests

Attempt 1: Using the StepFunction output

Our first step for integration tests was to have the StepFunction running either locally or in a test environment, which we triggered directly in our tests using the AWS SDK.

This gave us the ability to control the input of the StepFunction and check for the output but was lacking in a few areas.

Firstly We found that this approach lacked the granularity we needed. If we take the initial service example, although we could check the overall output we could not check any side effects like what was being saved in the DynamoDB.

Secondly, the SDK for StepFunctions was not friendly for local use. We needed to have the tests runnable locally as well as in a pipeline but there were differences in if you can call the StepFunction synchronously locally vs on the cloud.

Finally, and something we only realised later, as we are calling the StepFunction directly we are not testing the EventBridge rules at all in this approach.

Attempt 2: Using Lambda Logs

Our next attempt was to test the steps we had our Lambda's output at various points into specific logs which we then pulled into the testing library, parsed them and ran assertions on the results.

While this approach allowed us to test with reasonable granularity, it was quite messy. We had code (log outputs) intrinsically linked to specific tests. This made it too easy to fudge results accidentally and to make changes to the code without requiring a change to the tests.

It also meant that we had a load of log entries where we could pass the test but not functionally work e.g.:


exports.handler = async function(event, context) {
  const records = event.records;
  records.forEach(record => {
    record.name = record.name.trim();
  })
  console.log('## RETURN OBJ: ' + JSON.stringify(records))

  return event.records;
}

In the above example, we are mistakenly returning the initial object instead of the processed object and although we are logging the process object, it does not reflect the functionality of the codebase.

This was the point we realised that we were not testing the trigger rules that we have in the service because we were not testing via EventBridge

Attempt 3: Using input and output events

Our third (and surviving) attempt was to use a deployed version of the application.

We knew we wanted to trigger the service through the events and look at the resultant events so that we are testing the service through all event-driven inputs and outputs.

We also knew we wanted to have the testing architecture ephemeral, where we can spin it up and down, so it's alive only when tests are running.

We created a helper library that had the following functionality:

  • The ability to create testing architecture before running tests

    • This is run in our 'before()' or 'beforeAll()' depending on our framework
  • The ability to destroy the testing architecture after the tests

    • This is run in our 'after()' or 'afterAll()' depending on our framework
  • The ability to fire an event

    • This is called in each test to trigger the run
  • The ability to pull events that happened as a result of the initial event

    • This is called after firing the event, which we can then return assertions on

Changes to service architecture

The above chart shows the minor changes to the system architecture; there are no code changes required in the Lambdas, which is a nice advantage to this approach. All we need to do is fire events when something happens that we want to assert. Realistically we want to do this anyway, as other parts of the system may care about this happening and want to react to it.

Each event simply states what happened with the return of the Lambda. We have added an error state which throws if there is an irrecoverable error in the Lambda.

The testing architecture

To see what was added to the EventBridge, we need to create a few items whose job it is to monitor and hold this information. While there are a few choices, we opted for using an SQS queue that is specifically created for the current testing instance.

We will add the technical implementation and code teardown in a separate blog, but for this exploration, all you need to know is that we need to have a variable that carries through to the events. We use a specific source that we can hook into an EventBridge rule e.g.:

com.integration-test.${UUID}

When we start our tests, we spin up an SQS queue and a rule that picks up any events that contain the above source.

Here is an example of the rule:

{
    "source": [
        "com.integration-testing.${UUID}"
    ]
}

The UUID is generated at run time we can be sure that the only events that appear in the SQS queue have come from our specific run.

We can then use our source event and pull the recurring events from the SQS queue using the SDK e.g.:

const getMessageFromSQS = async () => {
   const params = {
    QueueUrl: sqsQueueUrl,
    WaitTimeSeconds: 5
  };
  return sqs.send(new ReceiveMessageCommand(params))
    .then(data => data.Messages)
    .catch(err => {
      console.log(err);
    });
}

This mechanism will return the first event that lands in the SQS queue in the next 5 seconds, so if the service creates multiple events, we will need to call SQS multiple times. This will either timeout returning an error to state no events were caught or a single event matching the one created in the StepFunction step.

Please be aware that if you have multiple events, you need to call the getMessageFromSQS multiple times, as only one is ever returned.

Summary

We hope from this article; you can see that with this approach, we can reliably test an event-driven serverless service from event to event (input to outputs) without the need to have permanent architecture deployed with our code or have testing-specific toggle changes in our codebase.

If you are interested in trying out our approach, take a look at the following npm package: https://www.npmjs.com/package/@3t-transform/test-n-vac