Protiviti / SharePoint Blog

SharePoint Blog

August 03
​Big Data and Nintex: Batch Processing Large Datasets in Parallel Using a Nintex Site Workflow

Scenario:

In a recent project for a client, we were unable to write any server-side code for advanced data processing, providing us with a perfect opportunity to use Nintex - this time - in the form of an ETL bulk data processing solution. 

The Solution:

Deciding on an Information Architecture for Large Datasets

To be fair, neither Nintex nor SharePoint 2013 were built to process bulk amounts of list data routinely.  There exists a List View Threshold Limitation in SharePoint 2013 that, by default, prohibits more than 20,000 items returned for a view for admins and 5000 items for non-admin users. Nintex’s list-based actions (e.g., Query List, Update Multiple Items, and Delete Multiple Items) are subject to this limitation. Providing query parameters that return more items than allowed will cause these actions to fail, throwing your workflow into an “Error Occurred” state.

To work around the threshold issue when processing large datasets, either of the following two approaches for changing the information architecture would be valid:

1) Change the quantity of lists (go wide, not deep): 

a. Make more lists to hold all the data.  This option will likely mean coming up with a naming convention and/or process for creating new lists that are associated together

Example: 

“My Large Dataset List_1”, “My Large Dataset List_2” …“My Large Dataset List_N”

2) Change the list schema (be more descriptive):

a. Supplement the fields in your data schema with a Batch ID value that can be used to identify discrete blocks of data in the large list.  (See example below)

melvin83151.PNG

melvin83152.PNG


Note: Another less obvious advantage of breaking your data up into smaller batches and processing it with Nintex workflows is that bad data or workflow logic errors are more easily targeted to the subset of data that caused the issue. 

Using the 2nd approach, we can easily populate the Batch ID column by keeping track of the number of Items we tag with a batch number, fractionally incrementing that number each time we reach the predefined batch size (generally <= 1000 items).  Having tagged each item with a Batch ID we can safely assume that specifying the batch id along with more detailed query parameters will assure faithful execution of the workflow.

Running Nintex workflows in parallel

After segmenting the data into batches it can be tempting to simply schedule a corresponding series of workflows to sequentially work through the inventory of batches to be processed.  While that is a nice “set it and forget it” way to process information, there is another more hands on way to work with a large dataset using a little manual intervention.

Let’s assume that we have a list of 40,000 items with a minimal schema (as identified in our previous example).  Assuming recommended configuration for a single-server SharePoint 2013 farm and no other major processes running on the server, moving through each batch in the list sequentially and processing 500 list items per minute would take approximately 80 minutes to complete.  Alternatively, processing 2500 items every 2 minutes (adding an additional minute for padding) would only take 32 minutes – yielding a considerable time savings of 48 minutes for an overall 2.5x efficiency increase.  To perform this test, do the following the steps below to repeatedly access the workflow start page for your Nintex workflow – specifying a new Batch ID each time.  Settling to never execute more than 5 workflows simultaneously (I have personally maxed out at 10 processes with this approach), simply wait for the first group of 5 batches to complete before repeating the process for the remaining batches.

Steps: Run the same workflow multiple times (in parallel)

1. Ensure your workflow has a start parameter configured to accept the Batch ID when the workflow starts.

melvin83153.PNG

2. Navigate to the Site Workflows area of your site and click on the workflow.  You should now be at the start page with the “Batch To Process” Parameter waiting for your input.​

melvin83155.PNG

3. Copy the URL

4. Open four more tabs with the same URL you just copied

5. Specify subsequent batch numbers in each tab (I’ve pulled a few tabs out into several windows so that you can see the pattern below)

melvin83156.PNG

Figure 1 - Processing (4) batches of data (from top to bottom, left to right – batches: 1.0, 1.1, 1.2, and 1.3)

6. Hit the Start button on Each window to start the workflow processing of each batch

7. Navigate back to the Site Workflows Page and monitor the progress of your running workflows.  You will need to refresh the page every minute or so to check on the progress.

melvin83157.PNG

8. Rinse and repeat Steps 5 - 7 for however many batches you need to process for the large dataset.  You will see that each workflow ran in parallel with the others that were started and completed at roughly the same time.


Quick Launch


© Protiviti 2019. All rights reserved.   |   Privacy Policy