New Book Review: "The Practitioner's Guide to Data Quality Improvement"

New book review for The Practitioner's Guide to Data Quality Improvement, by David Loshin, Morgan Kaufmann, 2010, reposted here:

The_practitioner's_guide_to_data_quality_improvement

Stars-4-0._V47081936_

Some of the other reviews that have been posted here provide some interesting observations from perspectives that are not always centered on data architecture or general enterprise architecture, and the hope of this reviewer is that he will be able to offer feedback to others on this text based on his consulting experience in these areas. In his preface, David Loshin comments that "this book is intended to provide the fundamentals for developing the enterprise data quality program, and is intended to guide both the manager and the practitioner in establishing operational data quality control throughout an organization, with particular focus on the ability to build a business case for instituting a data quality program", "the assessment of levels of data quality maturity", "the guidelines and techniques for evaluating data quality and identifying metrics related to the achievement of business objectives", "the techniques for measuring, reporting, and taking action based on these metrics", and "the policies and processes used in exploiting data quality tools and technologies for data quality improvement".

With these goals in mind, this reviewer thinks Loshin succeeded in this effort. Taking into account the fact that data quality is an enormous practice area, and success requires understanding of both data and the business to succeed, this introductory text walks the reader step-by-step through a considerable number of topics over which many authors would likely stumble. Some of the explanations that Loshin provides, such as the one in the chapter entitled "Developing a Business Case and a Data Quality Road Map" on how data flaws can incur business impacts, are extremely well done, especially when married with effective diagrams. And in his chapter entitled "Metrics and Performance Improvement", the author provides an explanation on drilling through key performance indicators that this reviewer has not seen elsewhere until this effort, and the presentation is exceedingly well done. Other areas of this text that this reviewer especially appreciates are the chapters entitled "Data Requirements Analysis", "Metadata and Data Standards", and "Inspection, Monitoring, Auditing, and Tracking".

This reviewer however would like to make potential readers of this book aware that most of what Loshin provides here is high level walkthroughs and examples of pertinent elements within data quality, rather than practical advice on how to approach much of the lower level work that should be expected to take place on a day-to-day basis. For example, in the chapter entitled "Entity Identity Resolution", the author provides a section on matching algorithms that briefly discusses parsing and standardization, abbreviation expansion, edit distance, phonetic comparison, and n-gramming, which consumes just a few short paragraphs. The author does not explain that there are many more matching algorithms currently in use in industry, that in most cases matching exercises need to take into account multiple rather than single algorithms in isolation, that in the world today internationalization takes an ever more important role when performing matching, and that there is a wide variety of commercial tooling available that needs to be assessed based on the needs of the organization.

However, armed with this knowledge the reader is sure to make use of this work by utilizing it while planning and strategizing data quality, as well as making use of it as introductory material to understanding what it might take to pursue efforts that require a higher level of data quality maturity such as master data management (MDM), in which case this reviewer recommends "Enterprise Master Data Management: An SOA Approach to Managing Core Information" by Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run, and Dan Wolfson (see my review). In the opinion of this reviewer, what Loshin provides here is best suited for managers looking to piece together all of the steps associated with data quality pursuits as well as get a better handle on how each of the steps are interrelated and whether each is a requirement or just an option, possibly looking to solve some aspects of data quality in an evolutionary, piecemeal fashion rather than revolutionary endeavor.


Comments from original Typepad blog:


David Loshin:
This is the same comment I left at Amazon, but since you had a copy of the review here I figured I'd add my comment here as well:

Thanks for the nice review, I appreciate your thoughtful comments. Clearly you have done a good job of summarizing some main points, and I am glad that you spent the time to read and then share your thoughts.

I do want to note, though, that for the most part the algortihms employed by most matching products are either deterministic based on a set of rules, or are probabilisitic and are rooted in the fellegi-sunter approach from the late 1960's (with some notable exceptions, although those approaches might be beyond the scope of the audience). In addition, most practitioners are less likely to be developing matching tools but will use existing tools, and my goal was to make them generally aware of the underlying approaches so that they can distinguish between the alternatives.

Also, I continue to add content to augment the material in the book at an accompanying web site http://dataqualitybook.com, and I invite you to visit and post your thoughts there as well. Thanks!
Wednesday, March 23, 2011 at 7:40 PM

Subscribe to Erik on Software

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe