Hi Alumio,
I’m often running into comparable issues as these topics of the past few weeks:
Basically, you want to ‘batch’ entities (or rows), but the input (doesn’t really matter what it is, might be JSON or XML) is too large to do at once. You could do it per row, but that means you have to create your own way of doing batches (which is something I do now, using storages and a lot of logic).
I had a case a while back where I need to (often, like once or twice per hour) run through a CSV with ‘price rules’ of around 250.000 lines. It seems that the “consumer received an entity from subscriber” action takes around 2-3ms of overhead (not sure, can’t really measure that, so just a guess). When working with 150.000 lines, that means 150.000 entities, which means 7-8 minutes.
Then I found that the CSV decoder has the option baked in to ‘batch’ (“Items in group” / “The amount of items that are bundled in single entity. 0 means the entire file in one entity.”) lines, up to a maximum of 1440. The time went from 7 minutes to 6 seconds! That’s because it went from 150.000 entities it had to work through, to 104. The actual gains are even bigger: because I was able to remove my own ‘batching’ mechanism (which means writing to a storage entity for every entity it works on), the actual “time to task” went from like 15 minutes to 7 seconds.
What I would like to see: if every Response Decoder has, like the CSV decoder has, a built in way to ‘group’/‘batch’ lines. So let’s say you use the JSON Response Decoder, with the Read Method ‘incremental’ and the path customers[]
and you want to work with batches of 1.000, you could just set the “Items in group” to 1.000. The entity you’d get would just be an array of those 1.000 first (and in later entities the following) entities.
This would be very useful in our toolkit. When working with large datasets, primarily you need to take care you don’t use too much memory. For memory usage, it is always best to work with as small a dataset as possible. For speed however, the opposite is often true: when having to run through 100.000 lines, it is likely fastest to run through it all at once.
Right now, we often only have the option to do it per entity (or row) or as one whole. We need something in between.