Module notzed.dez
Package au.notzed.dez

Interface DEZFormat

  • All Known Implementing Classes:
    DEZDecoder, DEZEncoder, DEZPrinter

    public interface DEZFormat
    Defines constants used in the DEZ1 format.

    Data Types

    The patch format is a byte stream consisting of a header and a sequence of bytes, followed by a crc code.

      Ciiiiiiii   integer    A multi-byte integer.  C is a continue bit.  The
                             bits are in big-endian order.
      aaaaaaaaa   address    An encoded address.  See below.
      ddddddddd   data       A data byte.
      XXXX        4-bytes    4 sequential bytes.
     
    A trailing (*) indicates zero or more instances. A trailing (+) indicates a self-describing encoding of one or more bytes. A trailing (?) indicates an implied encoding of one or more bytes.

    Patch Format

     HEADER:
      XXXX        magic      4 bytes ASCII, 'DEZ1'.
      Ciiiiiii+   smallest   Smallest copy offset.
      Ciiiiiii+   split      Split point for single operation opcode.
      Ciiiiiii+   source     Source data size.
      Ciiiiiii+   target     Target data size.
    
     DATA:
      IIIIIIII*   commands   Zero or more instruction opcodes and operands
    
     CRC:
      XXXX        crc        4-byte CRC of target bytes, network order.
     

    Instruction stream

    Instructions can perform either a single operation or two operations with limited ranges.

    Single commands perform one operation. They are a 7-bit value with the 8th bit set. A range test defines the operation.

    
     1nnnnnnn:  assuming a default split=100
    
     [000 099] aaaaaaaa+           COPY Copy (n+smallest) from the address.
                                          smallest .. smallest+99
     [100 123] dddddddd?           ADD  Append (n-99) from the patch.
                                                 1 .. 24
     [    124] Ciiiiiii+ aaaaaaaa+ COPY Copy (i+100+smallest) from the address.
     [    125] Ciiiiiii+ dddddddd? ADD  Append (i+(124-100)+1) from the patch.
     [    126] Ciiiiiii+ dddddddd  RUN  Append (i+3) copies of the data byte.
     [    127]                          Reserved.
     
    The split point between an immediate-length COPY and ADD is stored in the stream. The default OP_SINGLE_SPLIT is 100.

    Dual command perform two operations with a single opcode. The lengths are encoded as immediate values and have a limited range of 3 bits each.

    
     00lllccc dddddddd? aaaaaaaa+   ADD+COPY
    
                Append the next (lll+1) data bytes (dddddddd?) from the patch.
                Then copy (ccc+smallest) bytes from the address (aaaaaaaa+).
    
     01cccCCC aaaaaaaa+ AAAAAAAA+   COPY+COPY
    
                Copy (ccc+smallest) bytes from the first address (aaaaaaaa+).
                Then copy (CCC+smallest) bytes from the second address (AAAAAAAA+).
     

    Addresses

    Addresses are a logical pointer which refers to a location within a buffer consisting of the concatenated source and target buffers. The encoder only referneces addresses that have already been seen so they will also be visible to the decoder.

    A small address cache is maintained in both the encoder and decoder to the last N recently accessed addresses. This can be used to encode the address with fewer bytes.

    Addresses can be encoded in three ways.

    1. An exact reference to an address cache entry.
    2. A reference to an address cache entry with a signed offset.
    3. An absolute address.
    The first byte determines the address encoding.
    
      00nnnnnn           - use address 'n' exactly.
      01Snnnnn Ciiiiiii+ - use address 'n' plus (S=0) or minus (S=1) 'i'
      1iiiiiii Ciiiiiii+ - use address 'i' absolute.  Always at least 2 bytes.
     

    The address table is 64 elements long but the relative instruction can only reference the most recent 32 elements. It is updated each time an address is encoded or decoded in a simple rolling manner.

    Conceptually it can be broken into two separate parts:

       matchAddr[(matchNext++) & 63] = addr;
       recentAddr[(recentNext++) & 31] = addr;
     
    First the encoder searches the match table; if it finds a match the address is encoded implicitly in the opcode via an index.

    Next the encoder finds the nearest match from the recent table as absolute distance. The encoder will then encode the address as a relative offset from a member of the table if the total encoded length is shorter than encoding the address absolutely.

    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int ADDR_ABSOLUTE
      Indicates whether this is an absolute or indirect address.
      static int ADDR_RELATIVE
      Indicates whether the address is relative or exact indirect.
      static int ADDR_SIGN
      The sign of the offset for a relative indirect address.
      static byte[] MAGIC
      File magic 'DEZ1'.
      static int OP_ADD
      Extended ADD opcode.
      static int OP_COPY
      Extended COPY opcode.
      static int OP_DUAL_COPY
      COPY select bit in dual opcode.
      static int OP_EXT
      Extended command.
      static int OP_RUN
      RUN opcode.
      static int OP_SINGLE
      Single-operation opcode selector bit.
      static int OP_SINGLE_LIMIT
      The limit of immediate length single operations.
      static int OP_SINGLE_SPLIT
      The default split-point between copy and add for immediate length single operations.
    • Method Summary

      Static Methods 
      Modifier and Type Method Description
      static byte[] decode​(byte[] source, byte[] patch)
      Apply a DEZ-1 patch to data.
      static byte[] encode​(byte[] source, byte[] target)
      Creates a DEZ-1 format patch from data.
      static byte[] encode​(int blockSize, int stepSize, byte[] source, byte[] target)
      Creates a DEZ-1 format patch from data.
    • Field Detail

      • MAGIC

        static final byte[] MAGIC
        File magic 'DEZ1'.
      • OP_DUAL_COPY

        static final int OP_DUAL_COPY
        COPY select bit in dual opcode.
        See Also:
        Constant Field Values
      • OP_SINGLE

        static final int OP_SINGLE
        Single-operation opcode selector bit.
        See Also:
        Constant Field Values
      • OP_SINGLE_SPLIT

        static final int OP_SINGLE_SPLIT
        The default split-point between copy and add for immediate length single operations.

        A code below this value is a COPY, otherwise it is an ADD.

        See Also:
        Constant Field Values
      • OP_SINGLE_LIMIT

        static final int OP_SINGLE_LIMIT
        The limit of immediate length single operations.

        This is the first non-immediate opcode (i.e. op_copy)

        See Also:
        Constant Field Values
      • ADDR_ABSOLUTE

        static final int ADDR_ABSOLUTE
        Indicates whether this is an absolute or indirect address.

        If this bit it set this represents the first byte of a multi-byte encoded integer whose value is the absolute address reference.

        If this bit is clear then the address is an indirect address.

        See Also:
        Constant Field Values
      • ADDR_RELATIVE

        static final int ADDR_RELATIVE
        Indicates whether the address is relative or exact indirect.

        If set then the lower 5 bits select the source address slot and the ADDR_SIGN together with the following integer indicate the offset.

        If not set this is an exact indirect address and the lower 6 bits select the source address slot.

        See Also:
        Constant Field Values
      • ADDR_SIGN

        static final int ADDR_SIGN
        The sign of the offset for a relative indirect address.

        If set the offset should be negated.

        See Also:
        Constant Field Values
    • Method Detail

      • encode

        static byte[] encode​(byte[] source,
                             byte[] target)
        Creates a DEZ-1 format patch from data.

        This uses typical encoding parameters suitable for small textual blobs.

        Parameters:
        source - Source bytes.
        target - Target bytes.
        Returns:
        patch, the binary delta.
      • encode

        static byte[] encode​(int blockSize,
                             int stepSize,
                             byte[] source,
                             byte[] target)
        Creates a DEZ-1 format patch from data.

        This form allows tuning parameters.

        blockSize
        Recommended 6, typically 4-8. A smaller value will allow matching shorter runs but will increase hash collision rates and increase encoding times.
        stepSize
        Recommended 1 for text, typically 1-8. This has the largest impact on encoding time as it directly relates to the number of run-length tests. Smaller values will always produce more compact patches.
        Parameters:
        blockSize - The size of blocks considered for matching. Will be clamped to be ≥ 3.
        stepSize - The sampling step size. Will be clamped to be ≥ 1.
        source - Source bytes.
        target - Target bytes.
        Returns:
        patch the binary delta.
      • decode

        static byte[] decode​(byte[] source,
                             byte[] patch)
                      throws DEZFormatException
        Apply a DEZ-1 patch to data.
        Parameters:
        source - Source bytes.
        patch - Patch bytes.
        Returns:
        target, the reconstructed data.
        Throws:
        DEZFormatException - If the patch is not a compatible format, or the crc check failed.