How to calculate the number of distinct rows from a table or a file within a data flow using SSIS?
Tutorial to learn how to count distinct rows in a SQL Server table or data files with SSIS and store it in a text file. To achieve this, we use the SSIS Aggregate transformation. This operation is the SSIS equivalent of executing a COUNT DISTINCT command in SQL Server or in standard SQL. Indeed, the COUNT DISTINCT aggregation option is used to calculate the unique rows in a table or file, within a data flow.
1. Create an SSIS package to count distinct rows
One prerequisite is to have the Adventureworks sample database installed and configured in one database. To begin, in the Data Flow, place a source, the SSIS Aggregate component, and a destination. For this example, the source will be the SSIS OLE DB Source component and the Flat File Destination will serve as the destination for exporting the result to a text file. In order to calculate the number of unique rows with SSIS, add these 3 components to the data flow:
- The OLE DB Source
- The SSIS Aggregate transformation
- And the Flat File Destination
2. Point the OLE DB source component towards the SQL table
Next, it is necessary to configure the SSIS Data Flow. Visual Studio will now display error messages on the Integration Services components. Afterward, double-click on the OLE DB Source component and select a table. For this example, we are selecting the HumanResources.Employee table from the AdventureWorks database.
3. Configure the SSIS Aggregate Component with Count Distinct
Once the source has been selected and configured, it is time to configure the SSIS Aggregate component. In the Aggregations tab, select the columns where the COUNT DISTINCT should be performed, that is, the columns to aggregate. In the Operation column, choose COUNT DISTINCT.
4. Link the Aggregate transformation to the target file
Now, connect the SSIS Aggregate component to the final component, known as the Flat File Destination. Double-click on this component to select the destination file. Finally, in the Mapping tab, ensure that the columns are correctly selected.
After configuring all the SSIS package components, there will no longer be any errors in the Data Flow.
5. Execute the package to export the distinct row count
Next, execute the SSIS package. The exported flat file will contain the distinct row count for the selected columns. The target file will now consist of a single row and the number 290, which corresponds to the number of rows in the input table.
Conclusion on the SSIS Aggregation Component
This article demonstrates step by step how to perform a count distinct with SSIS, without writing any line of SQL code. For performance optimization, it is recommended to execute SQL operations directly on the SQL Server engine, thus reducing the data transfer between the SQL Server database and the SSIS server. Integration Services packages enable all operations available in T-SQL through a graphical interface and a simplified industrialisation process. For example, here is how to pivot the rows of a SQL table into columns with SSIS.
Of course, it is also possible to perform the same pivot operation, which is also very handy but a bit more complex in terms of coding, using the SQL Server PIVOT query directly from SSMS and only with T-SQL code.