- Data that is complete, accurate and relevant
- A set of processes to achieve such a state
Data quality problem scenarios
There are often technical problems with source data, in the form of outliers which can’t be handled in the desired way. The primary responsibility is first to detect the “bad” data, and then to reject it (so it doesn’t break the ELT logic), and finally audit the fact that some data had to be rejected. That way:- Business decisions are made against a known baseline
- There’s a chance for any invalid data to get fixed at source
Missing values
Applies mainly to attributes rather than whole records. For example, it’s easy to detect if an airline ticket record is missing a “flight number” attribute. However it’s usually impossible to determine that someone should have bought an airline ticket, and that for some reason no record of them buying it has appeared.Handling duplicates
Duplicates are usually easy to detect, as long as it’s clear what the unique identifier should be. Sometimes duplicates are genuinely errors, but sometimes they may indicate a missing business rule. For example, you receive two ticket records for the same person on the same flight, but with different seat numbers. The missing business rule might be that their seat allocation got changed, and only the second record is still valid.”Dates” that are not dates
This applies to any data which looks like something else, and which maybe a human could read, but which is not meaningful to a computer. For example, a string column containing the value2017-01-29.
This is an array of bytes [32,30,31,37,2d,30,31,2d,32,39] which can be successfully converted into a date. Whereas the string tomorrow, which makes perfect sense to a human, would be read as the byte array [74,6f,6d,6f,72,72,6f,77], and which is not a legitimate date.
Missing values
In a Transformation pipeline, detection of missing values is most commonly achieved using a Filter component. The following pipeline shows a flight Table Input component with several filters. Records which pass all the filters along the “happy path” at the top have both an Origin and a Departure airport, and are therefore usable for further processing.
In contrast, any records which fail either filter will end up going to the Unite operator for accumulation. The Calculator component adds a couple of attributes recording the time and reason for failure.
All records which pass the filters to end up in the “problem path” should be saved to an audit table using the designer’s Table Output component.
From there they can easily be sent back to source for data correction, or else reprocessed later once the missing business rules have been added.
Complementary filters
This pipeline also illustrates the important and commonly used technique of adding complementary filters. The conditions in the pairs of filters highlighted below are exactly opposite to each other, and every record will either fall into one or the other category.
It’s important that these pairs of filters really are exactly opposite. There should be no cases where a record is successfully processed and gets audited to the rejection log.
Handling duplicates
It’s vital for relational data to have known unique or primary keys. Primary keys are used only by the optimizer and are not enforced. Detecting duplicate data and acting accordingly is most commonly done using an Analytic (or “Window”) function in a Calculator component, with a formula like this:| Flight Number | Seat Number | Timestamp | ROW_NUMBER() OVER (PARTITION BY flightnum, seatnum ORDER BY timestamp) | Comment |
|---|---|---|---|---|
| BA123 | 43F | 2009-01-11 14:52:01 | 1 | First allocation |
| BA123 | 67B | 2009-01-11 14:58:18 | 2 | Changed by customer |
| BA123 | 52A | 2009-01-12 10:02:29 | 3 | Final allocation |
ORDER BY <<sort columns>> DESCENDING)
The complementary filters would be implemented by checking for the value equal to 1 (not a duplicate) and anything greater than 1 (is a duplicate), like this:
“Dates” that are not dates
You should rightly expect that certain datatype conversions should be performed automatically, because they are unambiguous. For example it usually causes no problems to implicitly convert the strings “1”, “2” and “3” into integers, and append them into a table already containing integers. The database should successfully do this on your behalf without any manual intervention needed in the pushdown SQL. Dates, however, are a different matter, and can cause data quality problems. For example a CSV file might contain the string “2009-01-11” indicating a seat reservation date. If taken from an American system you should assume that it means November 1st. But on an Australian or British system you should read it as January 11th. You might be lucky and have the database correctly guess which interpretation you need. But when converting strings like these into real date data types, it’s best to be explicit about the format mask, using a TO DATE function such as: This also gives you the opportunity to apply business rules (such as “tomorrow” meaning system date + 1 day), and to reject dates that really do contain nonsense values. For example, in a Calculator component, you could do as follows:- the_date IS null or blank
- the_date IS NOT null or blank
Applying business rules
The above examples have shown how to apply simple business rules to incoming data, to interpret it correctly.- If you receive more than one seat allocation record for a flight, the most-recently updated record is the one to use
- The date value “tomorrow” means “today’s date + 1 day”
- Leave nonsense dates blank in the output
- Flight distance is sometimes missing
- Sometimes there is a small disagreement in distance between the outbound and inbound legs of the journey between the same two airports
- In some cases the recorded distances have changed over time (maybe due to differences in how it was calculated)
- If “distance” is present then just use it in the flight “fact” table
- If “distance” is missing, then fill in from the most-commonly found distance value for other flights between the same origin and destination, in either direction.
- In the “route” dimension, use the most-commonly found distance value for all flights between that origin and destination
The above transformation pipeline creates the intermediate table in several steps:
- Only use records where source, destination and distance are all present (Filter known)
- Aggregate and count occurrences (Agg)
- To obtain a consistent direction, multiplex through almost-complimentary filters (
origin < destinationandorigin > destination), and swap the column names along one branch. This aligns all data between pairs of airports, regardless of flight direction. It also takes the chance to remove any spurious data where the departure and arrival airports are the same. - In “Sum Frequency”, add all the pairs of airports (so 100 instances of LAX → JFK plus 101 JFK → LAX would become 201 instances of JFK → LAX)
- Use the ROW_NUMBER analytic function to find the most commonly recorded distance between every pair of airports
- (Re-)Create the common_flight_distances table containing the results
The common_flight_distances is replicated, and left-joined twice to the fact table: once on origin=origin and destination=destination, and then once on origin=destination and destination=origin. This caters for the fact that in the summary table the origin is always alphabetically lower than the destination, whereas in reality flights can go either direction.
From the second join, three distance measurements are available:
- The original distance
- The most commonly measured distance (if the origin happens to be alphabetically lower than the destination)
- The most commonly measured distance (if the origin happens to be alphabetically higher than the destination)

