Task Deduplication

Hi Alumio,

I’m getting more-and-more usecases where the body of a task is really just a placeholder saying “this particular entity has to be updated”. This generally happens when the actual content of a task is too big (and you have to place the update inside a storage or something) or when the ‘update trigger’ (i.e. a particular system saying it has been updated) doesn’t actually give you the entirety of the particular entity you’re working with.

For those use-cases, a ‘task deduplication’ option on the route would be great. Basicly, before creating a task, it would check within that route to see if there’s a new task with that exact entity-identifier and it would skip it.

Hi Floris,

Thank you for your feedback.

It seems that kind of functionality is already covered by “Filter previously stored entities” or the entity filter “Filter by storage entities”, right? Or, do you need a more handy option within the route to do so?

Hi Gugi,

Let’s say you have a “customer” entity. This customer entity needs to be sent fully to the receiving system, however, you get the actual data to build it from a few different sources:

  • An ‘Addresses’ endpoint
  • A ‘Contacts’ endpoint
  • A ‘Customer Group’ endpoint
  • A ‘Company’ endpoint

Each has their own updated_at timestamp, and adding, for example, a Contact to the Company will not ‘update’ the Company itself.

So to make this work, you can poll each of these endpoints to check for changes, and if you find a change, you get the Company Number, save it to a storage (= your queue). Add in an incoming which reads that particular storage and creates tasks into a full “update customer/company entity route”.

The “Filter previously stored entities” will not really work in this case. Because you don’t actually have the full entity before running the route.

Hi Floris,

Thank you for the thorough explanation.

Our existing entity filter “Filter by storage entities” can filter out entities or skip tasks whose identifier already exists in a storage. Please find the below example.

Let’s say we already have a storage filled with entities, such as below.

Each storage entity consists of id and name properties.

image

You can use “Filter by storage entities” with Condition “An item with the identifier must exist in the storage” in order to filter out new entities coming in.

Even though the coming entity only has the identifier, it’s enough to let Alumio filters out the entity due to having the same identifier with the one stored in the storage.

It’s not as handy as the “Task deduplication” option in the route, but this might help you to achieve your objective with the current functionalities.

Feel free to let me know if this is not the case.

Hi @Gugi ,

Wherever possible, I do use comparable techniques.

The thing is, that a task is computationally expensive for Alumio. So being able to prevent a task from being created at all would be preferable.

@Gugi I do like this feature.
With it, you don’t need to setup any storages.

Would more so be usefull with overlapping batch syncs (xml and the like).

Hi @floris You can put the entity filter in incoming configuration or route so that no task will be created.

@anon99371720 Do you mean that the identifier would be compared to the existing tasks, or we would need an “invisible storage” to log all the processed identifiers?

Hi @Gugi ,

In many different routes, you don’t actually have the entire entity in the incoming. The entity is basically constructed inside of a task. So that wouldn’t work.

Hi @floris,

The entity filter only needs the identifier and don’t need the entire entity to compare, like I mentioned before.

Hi @Gugi ,

But that’s not an option for the process as I described it.

You could, I suppose, create a ‘save to storage’ where you only place the entity identifier. That storage would then save all “active tasks”. And within the Outgoing you’d delete that entity from the storage. But then what happens when the Outgoing fails? The entity would then never be re-synced! (Note that you could somewhat fix this by utilizing TTLs etc.)

So no, that’s not an option for the situation I’m mentioning.