QuerySurge Technical Whitepaper No. 7
A common need for some QuerySurge projects is to check tables for duplicate rows. As you're likely aware, QuerySurge by design does not have "built-in" tests – we want to put the full power of your SQL dialects in your hands to craft the tests that you need, so we avoid any pre-configured tests. You are probably also aware that the minimal executable unit in QuerySurge is a QueryPair – a pair of logically related queries. In other words, QueryPairs are "binary" - because each is composed of two queries. So, how does a user implement a check for duplicate rows on a single table in a QueryPair? The answer is to set up a "one-sided" or "unary" test.
Writing a Duplicate-row Checking Query
The first thing we need to consider is the syntax for a duplicate-checking strategy. This can be done using the GROUP BY syntax that most database products offer, along with a HAVING clause that filters out column values that only appear once. Consider the following USERS table, in which one entry appears twice (ID = 8, 30, in red) and a second entry appears three times (ID = 2, 6, 21, in blue). All other entries appear only once.
If we want to check a single column for duplicates (the NAME column in this example), we can write:
SELECT u.Name, COUNT(u.Name)
FROM SALES.USERS u
GROUP BY u.Name having COUNT(u.Name) > 1;
Note: These examples are worked in Oracle 10g Express. The equivalent syntax for your database may differ.
The results of this query are:
Note that this solution just checks a single column for duplicates. We could extend the query to include all data columns explicitly in the GROUP BY, and use COUNT(*) to check for row counts:
SELECT u.Name, u.Nickname, COUNT(*)
FROM SALES.USERS u
GROUP BY u.Name, u.Nickname HAVING COUNT(*) > 1;
Note: We've ignored the ID column, which we're using only to call out specific rows in our discussion.
In this case, we get a similar result, except that we now can see that both columns are included:
Note that GROUP BY works at the row level on all columns listed, so that a true row-level evaluation is achieved. In other words, this syntax will find true duplicate rows (for those columns listed in the GROUP BY).
Implementing a Duplicate-Checking Test in a QueryPair
Now that we know how to check for duplicate rows, how can we formulate this as a unary test in QuerySurge? The answer is relatively straightforward: we can modify our duplicate-checking query to return a numeric value, depending on whether the resultset contains duplicates. This query can go on the Source side of a QueryPair. On the Target side, we just use a static query to return the expected result.
In order to modify our duplicate-checking query for QuerySurge, we’ll use the CASE syntax, and we SUM the duplicate row counts. If the GROUP BY row count is zero, there are no duplicates, and the query returns a ‘0’. If any row count is nonzero (as in our example above), then duplicate rows have been found, and the query returns a ‘1’.
So, we can write for the QueryPair Source query:
WHEN SUM(COUNT(*)) > 0 THEN 1
END AS HASDUPES
FROM SALES.USERS u
GROUP BY u.Name, u.Nickname HAVING COUNT(*) > 1
Note: There are several different approaches that one could use for writing a duplicate-checking query. You are free to use any approach that works for your problem.
Assuming we want our test to fail if there are duplicates found, we can write on the Target:
SELECT 0 AS HASDUPES FROM DUAL
Note: We're using the ‘FROM DUAL’ syntax because this example uses Oracle syntax; your database may require a different syntax.
As noted, if duplicates exist, the Source query will return ‘1’ and the Target will return ‘0’, as shown in Figure 1.
Figure 1. The duplicate checking test in QuerySurge. If a resultset has duplicates, the Source query returns ‘1’; if no duplicates are found, the Source query returns ‘0’. The Target query always returns ‘0’.
Note: The ‘Row Duplicate Options’ Feature
You may have noticed on the Properties tab for each QueryPair that there is a Row Duplicate Options property. This is set to ‘Do Not Check For Duplicates’ by default. Note that this feature, while similarly named, does not do what we have described in this whitepaper. This feature enables QuerySurge to handle resultsets that are already known to you to contain duplicate rows. Changing this to ‘Check For Duplicates’ enables a comparison algorithm for the QueryPair that uses a statistical approach to comparing Source and Target resultsets. It will not let you know whether it found duplicates; it will only let you know the results of the comparison. So – this feature is useful if you already know a resultset (either Source or Target or both) contains duplicate rows. It is not useful for testing whether a resultset contains duplicate rows – that is illustrated in the discussion above.
Note also that the ‘Row Duplicate Options’ feature is resource-intensive. Large resultsets will consume more resources and computing time, so use this option with some forethought, since you could run out of resources on a QueryPair of a size where you normally have sufficient resources.