The United States Postal Service and international databases are constantly changing, and one of the problems that all postal software vendors must face is providing their customers with up-to-date data. To do so, most vendors ship CDs and DVDs to their customers monthly or bi-monthly.
Patching is a new approach to battle the daunting task of database management, based on tools that have been available to software manufacturers for years. Patching is a method by which the latest database is compared to the last released version of the some database and only the new or changed data is extracted.
For example, consider one ZIP Code database that is over 500MB and is updated monthly by the USPS. Patching tools are able to compare the August version to the July version and generate a 2MB "difference file" that accurately represents only the differences between the two versions.
The plain benefit to this approach is that any customer who was sent the July database can now be fully updated to the complete August version by sending only 2MB to the customer, rather than 500MB. The patch software provides an "Apply" piece that will integrate the changes and create a 100% accurate database exact down to the byte. This enables Internet delivery, instead of costly and time-consuming CDs. A 2MB patch could even be e-mailed. Larger files, such as a combination of mailing and map data, can reach sizes of two or three gigabytes and beyond. The burden of providing updated data is compounded by the sheer volume. Patching tools are able to shrink these multi-gigabyte sized data files to mere megabytes, enabling faster delivery of updates to the customers who rely on accurate data for efficient operation.
History of Patching
Patching is a craft that dates back to pre-PC days on the mainframe. Back then, mainframe programmers would overlay existing instructions with new instructions to fix bugs. When necessary, the developer would branch the old instruction sequence in order to fit additional operations into a limited space literally "patching" the system.
The problem with this approach to updating databases was it required the time of highly skilled developers, was prone to error and simply was not generalized. From these early days of "patching" sprung tools that were able to automatically identify new or changed instructions and generate a "patch file" that would fix buggy code and add new functionality. These tools automated the problem of creating bandwidth-efficient updates for applications and small data files.
Today, the term "patch" is often used as a generic term meaning "a software fix" not necessarily one that utilizes the bandwidth reduction technique of identifying only new and changed information. To differentiate, those familiar with the art refer to true patching as byte-level differencing. Byte-level differencing specifically refers to the process of comparing two versions of a file and extracting only the changed and new information.
Is it Safe?
The idea of transmitting a difference file in place of full versions will give most database administrators an unnerving image of Star Trek's transporters. If just one byte gets put together wrong, you might find your nose where your ear is supposed to be.
Commercial byte-level difference engines have been around for over a decade and have been used for mission critical applications from desktop anti-virus program updates to updates for aerospace, military and the Department of Defense.
Modern byte-level difference utilities contain multiple safeguards including pre-deployment file verification and multiple checksum verifications during the application of the difference file to ensure that every byte is updated properly. Byte-level differencing technology is used every day on literally tens-of-millions of PCs, workstations and servers. The algorithms used for these "software" updates were fine-tuned for executables and small data files. The idea of using byte-level differencing on a gigabyte-sized database was previously not possible because of past limitations in the available tools.
Advancements in the State of the Craft
Recent advancements in the industry have broken a long-held barrier that made it very difficult to impossible to create byte-level difference files on very large databases. Today, distributors of huge databases such as the Delivery Point Validation (DPV) data from the USPS (over 145 million addresses) and the Postcode Address File (PAF) from Royal Mail (over 27 million addresses) are realizing that the same approach that has been used for years to update their software can now be used on their data as well.
These new tools are able to process a 1.5GB database in minutes, shrinking the size of the update by more than 90% in the common case. Since the process of identifying the changes is completely automated by the byte-level difference engine, the move to providing updates via the Internet can be made quickly and can be used as a replacement or supplement to existing update mechanisms (e.g., CD-based updates).
Given that the cost of undeliverable-as-addressed mail has now been reported by the MITF as $1.5 billion annually, the business necessity of maintaining data is clear. Byte-level differencing software provides a new but proven approach to ease the burden of providing bandwidth efficient updates faster and cheaper than conventional methods.
Kerry Jones, Ph.D. is CTO of Pocket Soft, Inc., where the industry's first commercial byte-level differencing software was released in 1992. For more information, contact Pocket Soft at 800-826-8086 or visit www.pocketsoft.com.