Schedule data processing in future

Manvel Ghazaryan
5 min readAug 13, 2020

I work for a company in a sport domain. We deal a lot with sport events, teams, players etc. Part of our systems are ingesting this data from 3rd party APIs. These are implemented as Azure Functions, which run under a schedule (every hour, once a day at specific time etc…). Ingested data is then sent further for processing and persistence.

The scheduling part is handled by Logic Apps, which allows to define robust schedules. Each function is triggered by a Service Bus queue trigger. Trigger message carries some details about data ingestion. For example it can say “ingest game schedules, that start between today & today + 7 days”.

Below is the visualization of this process

Architecture diagram

One of our customers had a request to continue ingesting their baseball events for 7 days after it has finished. Essentially, these are baseball events with a list of participants. The request was to ingest list of participants for 7 days after an event has finished. Customer has an API to ingest event participants, e.g.

GET /api/event/{eventid}/participants

Querying endpoint by passing event Id (ingested via another API call) would return a list of participants.

Traditional approach

First thought was to create a new function, which will be triggered daily by Logic App. Upon execution, it will ask for all events that were finished within last 7 days from our internal systems, and for each event process the list of participants.

So the steps would be

1. Logic App triggers function

2. Function executes, requesting internal downstream system for all events that finished within last 7 days

3. For each returned event Id, query customer API for event participants

4. Send ingested event participants for further processing

This is a valid solution, it will get the job done. However, I would prefer to have a function that does less. Ideally it should get an input, execute, produce an output. The analogy I’m thinking about is Multiply function, which gets 2 numbers to multiply as input, multiples them and returns the result. Less orchestration it does, less moving parts there are in function, easier it is to reason about it and maintain.

Basically, we’re looking for a solution where function gets executed and input message carries event ids to be processed.

Towards thin function

We already discussed how Logic App is used to trigger function invocation. We can define a Logic App, which runs under a schedule, executes internal HTTP endpoint to get list of events to be processed, and send this data to a service bus queue effectively triggering function.

To visualize the steps:

Logic App Designer

This would achieve desired state of having an azure function which gets an input, executes and produces an output. Notice that azure function itself won’t have to make any calls to internal system to get its input data (event ids to ingest data for).

If we compare this with previous solution, steps would be

1. Logic App triggers function

2. For each event Id in the input, query customer API for event participants

3. Send ingested event participants for further processing

Note that we moved figuring out which event ids should be processed to Logic App (steps #1 & #2 merged into step #1 in this solution). While we’re doing same amount of work, we redistributed responsibilities.

Towards self-contained solution

I was thinking if is it possible to schedule event processing without a need to rely on other parts of the system. At the end of the day source of events is 3rd party API. Function ingests data from this API, so it has end date for the event. What if we schedule finished event processing from the function itself.

Here’s the idea.

Azure Service Bus supports scheduled messages. We can schedule a message to be processed at some point of time in the future. We can use scheduled messages of Azure Service Bus and schedule finished event processing.

Here’s some code

This is the function that ingests data from 3rd party API. After events are processed we’re scheduling finished event processing.

Some code is omitted for the sake of brevity, but it illustrates the idea: for each event that was processed, we take end date and schedule event processing message for each day after end date up to 7 days. E.g. let’s say there is an event that ends tomorrow. This will schedule 7 messages, first will be scheduled for the day after tomorrow, second on the next day… up until 7 days from tomorrow.

Now to the last piece, Azure Function that processes those scheduled messages

This function will execute for each scheduled message, which carries all data needed for an invocation.

Let’s compare this with previous solution.

First we’re mixing responsibility of scheduling. As explained before, Logic Apps are responsible for scheduling. With this change, now some scheduling happens in function itself. So we split scheduling between Logic App & Function. There is no single place to go and look up things related to scheduling. This will become more messy with introduction of new data processors.

Another point we were discussing before is to keep functions concentrated on one thing (thin functions). We were trying to avoid doing anything else than ingestion of data in the function. With this solution function also takes care of scheduling finished event processing.

This feels more self-contained solution. Data ingestion from this source (3rd party API) resides in this part of a system. It knows how to process data from the source, processing finished events for 7 days after it’s finished is part of data processing requirement for this specific source.

Final words

Which solution to use ? I think 1st solution (adding some logic to Logic App to trigger data processing) is more superior. Scheduling processing in the future is very powerful tool and has its uses but doesn’t fit here quite well.

--

--