Few steps:
Analyze
source documentation
As with many
other projects, when testing a data warehouse implementation, there is
typically a requirements document of some sort. These documents can be useful
for basic test strategy development, but often lack the details to support test
development and execution. Many times there are other documents, known as source-to-target
mappings, which provide much of the detailed technical specifications.
These source-to-target documents specify where the data is coming from, what
should be done to the data, and where it should get loaded. If you have it
available, additional system-design documentation can also serve to guide the
test strategy.
Develop
strategy and test plans
As you analyze
the various pieces of source documentation, you'll want to start to develop
your test strategy. I've found that from a lifecycle and quality perspective
it's often best to seek an incremental testing approach when testing a data
warehouse. This essentially means that the development teams will deliver small
pieces of functionality to the test team earlier in the process. The primary
benefit of this approach is that it avoids an overwhelming "big bang"
type of delivery and enables early defect detection and simplified debugging.
In addition, this approach serves to set up the detailed processes involved in
development and testing cycles. Specific to data warehouse testing this means
testing of acquisition staging tables, then incremental tables, then base
historical tables, BI views and so forth. Another key data warehouse test
strategy decision is the analysis-based test approach versus the query-based
test approach. The pure analysis-based approach would put test analysts in the
position of mentally calculating the expected result by analyzing the target
data and related specifications. The query-based approach involves the same
basic analysis but goes further to codify the expected result in the form of a
SQL query. This offers the benefit of setting up a future regression process
with minimal effort. If the testing effort is a onetime effort, then it may be
sufficient to take the analysis-based path since that is typically faster.
Conversely, if the organization will have an ongoing need for regression
testing, then a query-based approach may be appropriate.
Test
development and execution
Depending on
the stability of the upstream requirements and analysis process it may or may
not make sense to do test development in advance of the test execution process.
If the situation is highly dynamic, then any early tests developed may largely
become obsolete. In this situation, an integrated test development and test
execution process that occurs in real time can usually yield better results. In
any case, it is helpful to frame the test development and execution process
with guiding test categories. For example, a few data warehouse test categories
might be:
• Record counts (expected vs. actual)
• Duplicate checks
• Reference data validity
• Referential integrity
• Error and exception logic
• Incremental and historical process
• Control column values and default
values
Strategies for Testing Data
Warehouse Applications
Businesses are increasingly focusing on the
collection and organization of data for strategic decision-making. The ability
to review historical trends and monitor near real-time operational data has
become a key competitive advantage.
This article provides practical recommendations
for testing extract transform and load (ETL) applications based on years of
experience testing data warehouses in the financial services and consumer
retailing areas. Every attempt has been made to keep this article tool-agnostic
so as to be applicable to any organization attempting to build or improve on an
existing data warehouse.
There is an exponentially increasing cost
associated with finding software defects later in the development lifecycle. In
data warehousing, this is compounded because of the additional business costs
of using incorrect data to make critical business decisions. Given the
importance of early detection of software defects, let's first review some
general goals of testing an ETL application:
- Data completeness:
Ensures that all expected data is loaded into target tables.
- Data transformation:
Ensures that all data is transformed correctly according to business rules/mapping
rules and/or design specifications or based on ETL mapping document.
- Data quality: Ensures that the ETL application correctly
rejects, substitutes default values, corrects or ignores and reports
invalid data. We need to report failure records back to business team
- Performance and scalability:
Ensures that data loads and queries perform within expected time frames
and that the technical architecture is scalable.
- Integration testing: Ensures that the ETL process functions well
with other upstream and downstream processes.
- User-acceptance testing:
Ensures the solution meets users' current expectations and anticipates
their future expectations.
- Regression testing:
Ensures existing functionality remains intact each time a new release of
code is completed.
Data Completeness
One of the most basic tests of data completeness
is to verify that all expected data loads into the data warehouse. This
includes validating that all records, all fields and the full contents of each
field are loaded. Strategies to consider include:
- Comparing record counts between source data,
data loaded to the warehouse and rejected records.
- Comparing unique values of key fields
between source data and data loaded to the warehouse. This is a valuable
technique that points out a variety of possible data errors without doing
a full validation on all fields.
- Utilizing a data profiling tool that shows
the range and value distributions of fields in a data set. This can be used
during testing and in production to compare source and target data sets
and point out any data anomalies from source systems that may be missed
even when the data movement is correct.
- Populating the full contents of each field
to validate that no truncation occurs at any step in the process. For
example, if the source data field is a string (30) make sure to test it
with 30 characters.
- Testing the boundaries of each field to find
any database limitations. For example, for a decimal (3) field include values
of -99 and 999, and for date fields include the entire range of dates
expected. Depending on the type of database and how it is indexed, it is
possible that the range of values the database accepts is too small.
Data Transformation
Validating that data is transformed correctly
based on business rules can be the most complex part of testing an ETL
application with significant transformation logic. One typical method is to
pick some sample records and "stare and compare" to validate data transformations
manually. This can be useful but requires manual testing steps and testers who
understand the ETL logic. A combination of automated data profiling and
automated data movement validations is a better long-term strategy. Here are
some simple automated data movement techniques:
- Create a spreadsheet of scenarios of input
data and expected results and validate these with the business customer.
This is a good requirements elicitation exercise during design and can
also be used during testing.
- Create test data that includes all
scenarios. Elicit the help of an ETL developer to automate the process of
populating data sets with the scenario spreadsheet to allow for
flexibility because scenarios will change.
- Utilize data profiling results to compare
range and distribution of values in each field between source and target
data.
- Validate correct processing of ETL-generated
fields such as surrogate keys.
- Validate that data types in the warehouse
are as specified in the design and/or the data model.
- Set up data scenarios that test referential
integrity between tables. For example, what happens when the data contains
foreign key values not in the parent table?
- Validate parent-to-child relationships in
the data. Set up data scenarios that test how orphaned child records are
handled.
Data Quality
For the purposes of this discussion, data quality
is defined as "how the ETL system handles data rejection, substitution,
correction and notification without modifying data." To ensure success in
testing data quality, include as many data scenarios as possible. Typically,
data quality rules are defined during design, for example:
- Reject the record if a certain decimal field
has nonnumeric data.
- Substitute null if a certain decimal field
has nonnumeric data.
- Validate and correct the state field if
necessary based on the ZIP code.
- Compare product code to values in a lookup
table, and if there is no match load anyway but report to users.
Depending on the data quality rules of the
application being tested, scenarios to test might include null key values,
duplicate records in source data and invalid data types in fields (e.g.,
alphabetic characters in a decimal field). Review the detailed test scenarios
with business users and technical designers to ensure that all are on the same
page. Data quality rules applied to the data will usually be invisible to the
users once the application is in production; users will only see what's loaded
to the database. For this reason, it is important to ensure that what is done
with invalid data is reported to the users. These data quality reports present
valuable data that sometimes reveals systematic issues with source data. In
some cases, it may be beneficial to populate the "before" data in the
database for users to view.
Performance and Scalability
As the volume of data in a data warehouse grows,
ETL load times can be expected to increase, and performance of queries can be
expected to degrade. This can be mitigated by having a solid technical
architecture and good ETL design. The aim of the performance testing is to
point out any potential weaknesses in the ETL design, such as reading a file
multiple times or creating unnecessary intermediate files. The following
strategies will help discover performance issues:
- Load the database with peak expected
production volumes to ensure that this volume of data can be loaded by the
ETL process within the agreed-upon window.
- Compare these ETL loading times to loads
performed with a smaller amount of data to anticipate scalability issues.
Compare the ETL processing times component by component to point out any
areas of weakness.
- Monitor the timing of the reject process and
consider how large volumes of rejected data will be handled.
- Perform simple and multiple join queries to
validate query performance on large database volumes. Work with business
users to develop sample queries and acceptable performance criteria for
each query.
Integration Testing
Typically, system testing only includes testing
within the ETL application. The endpoints for system testing are the input and
output of the ETL code being tested. Integration testing shows how the
application fits into the overall flow of all upstream and downstream
applications. When creating integration test scenarios, consider how the
overall process can break and focus on touch points between applications rather
than within one application. Consider how process failures at each step would
be handled and how data would be recovered or deleted if necessary.
Most issues found during integration testing are
either data related to or resulting from false assumptions about the design of
another application. Therefore, it is important to integration test with
production-like data. Real production data is ideal, but depending on the
contents of the data, there could be privacy or security concerns that require
certain fields to be randomized before using it in a test environment. As
always, don't forget the importance of good communication between the testing
and design teams of all systems involved. To help bridge this communication
gap, gather team members from all systems together to formulate test scenarios
and discuss what could go wrong in production. Run the overall process from end
to end in the same order and with the same dependencies as in production. Integration
testing should be a combined effort and not the responsibility solely of the
team testing the ETL application.
User-Acceptance Testing
The main reason for building a data warehouse
application is to make data available to business users. Users
know the data best, and their participation in the testing effort is a key
component to the success of a data warehouse implementation. User-acceptance
testing (UAT) typically focuses on data loaded to the data warehouse and any
views that have been created on top of the tables, not the mechanics of how the
ETL application works. Consider the following strategies:
- Use data that is either from production or
as near to production data as possible. Users typically find issues once
they see the "real" data, sometimes leading to design changes.
- Test database views comparing view contents
to what is expected. It is important that users sign off and clearly
understand how the views are created.
- Plan for the system test team to support
users during UAT. The users will likely have questions about how the data
is populated and need to understand details of how the ETL works.
- Consider how the users would require the
data loaded during UAT and negotiate how often the data will be refreshed.
Regression Testing
Regression testing is re validation of existing
functionality with each new release of code. When building test cases, remember
that they will likely be executed multiple times as new releases are created
due to defect fixes, enhancements or upstream systems changes. Building
automation during system testing will make the process of regression testing
much smoother. Test cases should be prioritized by risk in order to help
determine which need to be rerun for each new release. A simple but effective
and efficient strategy to retest basic functionality is to store source data
sets and results from successful runs of the code and compare new test results
with previous runs. When doing a regression test, it is much quicker to compare
results to a previous execution than to do an entire data validation again.
Taking these considerations into account during
the design and testing portions of building a data warehouse will ensure that a
quality product is produced and prevent costly mistakes from being discovered
in production.
No comments:
Post a Comment