DLSuperC info | download ver 7.3a - 2/25/05 (473 kb) DLSuperCX info | download ver 1.8 - 05/18/04 (438 kb) DLSuperCBF info | download ver 3.2 - 05/18/04 (366 kb) DLSuperCTW info | download ver 2.4b - 02/26/05 (605 kb) DLSuperCBT info | download ver 2.2 - 05/18/04 (339 kb) DLSuperCRV info | download ver 1.4 - 05/22/02 (331 kb)
|
Introduction
DLSuperC
is a compare program that differs from most normal compare programs and yet it
has many of the same external displaying features. When you run
DLSuperC
,
you expect
it to detect differences and produce an output that shows where the differences have been found. A
deficiency of some of the other programs are that they sometimes tend to get lost when producing results
even showing differences in areas that haven't been changed. Many programs also are limited in their
capacity to handle large files whereas
DLSuperC
can handle huge files utilizing a unique
iterative process that dynamically partitioning the file into sections and then combines the individual partial
results into one overall set of statistics. Not many compare programs feature filters whereby the
input files can be subjected to a set of text definitions to suppress data that might not be of
interest in detecting changes at any particular time. Comment lines, predefined lines of unimportant
text, sequence number columns from
the input data stream are examples of data that can be avoided in preparing the compare set.
Some older compare programs used sequence numbers to detect where changes had been
made. Of course, most newer PC program files don't normally have sequence numbers so this is an
outmoded consideration. Some other newer programs scan both files to be compared and use different
criteria for finding change intervals. These programs mostly use the consecutive text sequences
from the original files and then try to find where they differ by using an assortment of iterative
algorithms digesting these text strings. This could consume a lot of I/O accesses (due to not being able to
accommodate the complete input files within real storage) especially if the
compared input had long lines and were large files. Other programs even try to sort lines and then utilize
some tagging notation in determining matches, inserts and deletions.
Key distinguishing capabilities featured inDLSuperC
The following discussion will assume
that the two input files are text files which are composed of a number of lines. One file is termed the old file
while the other is the new file. The example might be also be termed the original old file and the other the
modified or updated new file Such simplifications are only used in the explanation as the files
could be reversed while the only change would be that the insert and delete references would also be
reversed. In fact, neither file needs to be a text file.
DLSuperC
An overall simplified description of the matching process is that
DLSuperC
The LMCS sets, using lines from the files in the explanation, could be as simple
as consecutive matching lines from each file which is usually the
standard definition of a LMCS set.
DLSuperC
, further, expands the LMCS set
where lines are initially preprocessed to remove all included blank
characters. No changes are made to the user input files. Only the internal buffers are modified.
Blank filtering allows the matching set
to include reformatted lines to increase the size of an existing
LMCS set. This maximizes the probability of
correctly locating user expected match sets that might be overlooked due to the
lines being actually changed and shortening the length of the, now, expanded LMCS.
This also creates longer match sets making them
more selectable than other matching sets that could distort final boundary setting.
The LMCS set can be further affected by the processing due to the
optional exclusion capability in
DLSuperC
As each line is processed, it is compressed into a 32 bit hash fixed value
that temporarily represents the original input line data content. Most of the
comparison process uses this hash values as a substitute for the input line. This hash value
does not insure that each line has a unique character content value. It is a simplification
that speeds up the comparison process.
Comparing a fixed value is faster than using a variable length text string to find equal lines.
Hash values can never be singularly used to detect equal lines since their data content might be different.
However, hash values that are unequal guarantees that the lines are not equal.
Processing must insure that equal hash values that have the same data content can be easily selected.
This is implemented by using a chaining mechanism whereby the lines with the same data content are chained
together. Of course, this chain will have hash values that are also equal.
Blank suppression, line exclusion, column selection, and case normalization is employed prior to the hashsum
processing. Their effect must be reflected in the overall results and can be done after the comparison process
is completed but before the results are displayed and statistics are gathered.
However it should be pointed out that the computed hashsums are the exclusive basis for the match determination
as no dynamic substitution or content processing exits are made while the matching process is being done.
The old file of equal hash values are chained within an old file data structure using the hash array table
value reference. Array entries represent the last entry for the line with the hash value as the lines processed
in reverse input order. As each new entry
is entered, a backward chain develops pointing to any previous equal hash entry.
The new file line structure uses the final state of the array so that its entries
points to the first occurring old file line with the same hash value. The new file topmost
reference will be dynamically updated during the matching process as match spans eliminate structure
elements that no longer need to be referenced. Nevertheless, the inspection process
can use the initial chain structure to determine how to progress through the span matching process as the
links have been set up eliminating time consuming searches to determine the next eligible match.
Since equal hash values may represent text lines
that are not data content equal, these false matches (as promised) must be unchained and rechained
into separate chains that have both the same hash and data content. Now the LMCS process can be
started as all matching lines have been
identified and appear on correct chains. The task that is left is to determine where the best
matches are located and where the mismatches in the lines left over exists since these are
the inserted and deleted lines.
There are many duplicate matching lines and small matching sequences within modern day source file.
For example, there are a large number of identical "begin;" and "end;" full statement lines (followed by or proceeded by
one or two generic lines)
appearing throughout a typical Pascal program. These repetitive statement sequences can be termed as "onezs",
"twozs" and "threezs" matching candidates. Top, middle, and end matches exist independent to their
fit in the overall contextual matching of the source input. The best procedure is to try to, initially,
eliminate as many of them from the candidate matching sets since many will be eliminated in the
final results. The
DLSuperC
There is some optimizing code that advances
each new match span inspection to the next start point. It uses the last matched span length to eliminate
redundant examination for all the inclusive sets within the last determined match span. Using the length as an
increment to start the next set of candidates, the process
double checks to see if the last advancement increment might not have missed some eligible back-match candidates.
It is very tricky but seems to work - saving a lot of non-productive processing.
Another selection optimization used checks for equal length LMCSes to determine the best
LMCS to be selected among a number of equal length candidate LMCS sets. Some processing analysis ensues
to, conditionally, determine which candidate is the least destructive in setting its new match boundaries.
It is, also, possible that some compare sets can be
extended due to discontinuities of 1 to 3 unmatched lines. This could make this set a
better candidate than others that can not absorb any discontinuity. The discontinuity need not be the same
within each file. A simple example might be a pair of files that
has one insert in the new file and two deletes as changes in the old file. This might occur when two lines from an
updated file is replaced by a single line. Instead of two match sets being found, the initial
processing scan would find a single eligible set whereby each set absorbs the change discontinuity. The discontinuity
condition could exist in any real case and not only reduce the number of eligible compare sets to be processed but
enhance the proper selection of the real world matches within the changed source.
Since blanks are always suppressed from lines during initial preprocessing, all match sets
must be revalidated for matches that could be actual reformats. This is done in a final validation pass. An
exception for validating reformatted lines is where the number of blanks differ at the end of a line
as some editors truncate line ending blanks whereas others do not.
It is unreasonable for any user to discover by looking at a report listing to recognize whether blank differences
exists at the end of a line or the line is last character is the end of a line since CR/LF characters
are never indicated in a displayed line.
DLSuperC
The basic LMCS algorithm in
DLSuperC
In
DLSuperCRV
DLSuperCTW
DLSuperCBT
The LMCS parsing algorithm is common with
all the versions of the
DLSuperC
The iterative processing technology was first utilized in
DLSuperC
Content matching processing can be done with
DLSuperC
A
user having a sequential data base file, might have a requirement for the unordered compare option since
additions and deletions to the file are mostly random operations. The file may never be sorted.
|