Une version Française de ce post est disponible.

Ferris the crab, the unofficial Rust mascot

I'm working on a Rust project where I need to check a CRC32 checksum of a file. It is somewhat easy to do so using the CRC crate. This crate provides multiple methods to compute a CRC32.

CRC, Cyclic Redundancy Code, computing enable one receiver to check that transmitted data is error free. To do so, the data sender compute a checksum that depends on the sent data and transmit it. The receiver also compute the checksum, this time from the received data and match it against the sender's one. If they are equal, data was transmitted without error.

My goal was to reconstruct a file split in multiple encoded chunks. yEnc anyone ?

Each chunk has it's own CRC32 checksum embedded as an hexadecimal string, that I can match against once I have decoded it's data.

Something like that :

fn check_crc32(input: &String, crc32: u32) -> bool {
    match u32::from_str_radix(input, 16) {
        Ok(parsed_crc32) => parsed_crc32 == crc32,
        Err(_) => false
    }
}

...

check_crc32("aca76043", my_computed_crc)

Once I have decoded all the file's chunks, I recreate the file by writing all of the chunks at their destination offset in my output file.

One of the chunks may have the overall CRC32 checksum of the file. If I have one, I would like to verify it against my reconstructed file's one.

First try

When decoding chunks, I first thought about sharing a crc32::Digest and update it while writing decoded data to my Vec<u8> destination buffer. Once all chunks have been decoded, I would ask the digest for the overall CRC32 sum.

After a bit of struggling with the borrow checker, I succeeded in decoding a 175 parts, 100MB file. Noice !

However, when I checked my computed CRC32, it :

  • was not the same as the one provided in the file chunks and definitely
  • was not the same one as computed by the crc32 unix utility.

But my data written to disk matched the provided CRC32. So, I was obviously doing something wrong.

Not so fast, you fool

To speed up the decoding, I was reading one chunk at a time, decoding it to memory and updating the digest on the fly. Once in a while, I would flush to disk the decoded data, keeping in memory only metadata about chunks already read.

I noticed quite fast that reading data chunks in a random order was not such a good idea...

Indeed, CRC32 checksum is computed as data is written to the Digest. So updating it by throwing decoded data chunks in random order wasn't the right way to do it.

I could have reopened the file, reading every bytes out of it and recomputed the checksum. But I thought that if I was reconstructing a 4GB data file I wouldn't want to spend my time recomputing the checksum by reading again all of the data that I just had decoded.

Furthermore, if I wanted to multithread fearlessly the decoding or checksuming I needed to find another path.

And... what if I could compute the CRC32 of the whole file by combining all of the given chunks' checksums ? THAT would be nice.

Standing on the shoulders of giants

Obviously, this is an issue that others already tackled. And way better than you would ever conceive with your single mind.

In the famous zlib library Mark Adler wrote this not so well documented function : https://github.com/madler/zlib/blob/master/crc32.c#L372 .

I won't go into the nitty gritty details of how it works, he already explained it long and wide here.

What I know is that I needed it. And there was no such function in the CRC crate.

A quick web search about it reflected that most of the time people prefer to depend on the libz-sys crate. But I didn't want to depend on many numerous crates on my tiny side project. Rather than doing unsafe rust by calling the C library, I decided to take the bull by the horns and going pure rusty !

The solution

If you are looking for a pure rust crc32_combine function, here you go. You'll guess that it is massively inspired by the C function of the Zlib.

Combines two CRC checksums into one. For two data sets, Data1 and Data2 with sizes of len1 and len2, with CRC checksums crc1 and crc2. This function returns the CRC checksum of [Data1,Data2], requiring only crc1, crc2, and len2.

/* zlib.h -- interface of the 'zlib' general purpose compression library
  version 1.2.11, January 15th, 2017

  Copyright (C) 1995-2017 Jean-loup Gailly and Mark Adler

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Jean-loup Gailly        Mark Adler
  jloup@gzip.org          madler@alumni.caltech.edu

*/
const GF2_DIM: usize = 32; // dimension of GF(2) vectors (length of CRC)

fn gf2_matrix_times(mat: &[u32], vec: u32) -> u32 {
    let mut sum = 0;
    let mut idx = 0;
    let mut vec = vec;
    while vec != 0 {
        if (vec & 1) != 0 {
            sum ^= mat[idx];
        }
        vec >>= 1;
        idx += 1;
    }
    sum
}

fn gf2_matrix_square(square: &mut [u32; GF2_DIM], mat: &[u32; GF2_DIM]) {
    for n in 0..GF2_DIM {
        let d = mat[n];
        square[n] = gf2_matrix_times(mat, d);
    }
}

fn crc32_combine(crc1: u32, crc2: u32, len2: u32) -> u32 {
    let mut even: [u32; GF2_DIM] = [0; GF2_DIM]; // even-power-of-two zeros operator
    let mut odd: [u32; GF2_DIM] = [0; GF2_DIM]; // odd-power-of-two zeros operator

    let mut crc1 = crc1;
    let mut len2 = len2;

    // degenerate case (also disallow negative lengths
    if len2 <= 0 {
        return crc1;
    }

    // put operator for one zero bit in odd
    odd[0] = 0xedb88320; // CRC-32 polynomial
    let mut row = 1;
    for i in 1..GF2_DIM {
        odd[i] = row;
        row <<= 1;
    }

    // put operator for two zero bits in even
    gf2_matrix_square(&mut even, &odd);

    // put operator for four zero bits in odd
    gf2_matrix_square(&mut odd, &even);

    // apply len2 zeros to crc1 (first square will put the operator for one
    // zero byte, eight zero bits, in even)
    loop {
        // apply zeros operator for this bit of len2
        gf2_matrix_square(&mut even, &odd);

        if (len2 & 1) != 0 {
            crc1 = gf2_matrix_times(&even, crc1);
        }
        len2 >>= 1;

        // if no more bits set; then done
        if len2 == 0 {
            break;
        }

        // another iteration of the loop with odd and even swapped
        gf2_matrix_square(&mut odd, &mut even);
        if (len2 & 1) != 0 {
            crc1 = gf2_matrix_times(&odd, crc1);
        }
        len2 >>= 1;

        if len2 == 0 {
            break;
        }
    }

    // return combined crc
    crc1 ^= crc2;
    return crc1;
}

Usage example :

let mut computed_crc32 = 0;
for part in &self.parts {
    let part_crc32 = match u32::from_str_radix(
        &part.infos.trailer.part_crc32.as_ref().unwrap().clone(),
        16,
    ) {
        Ok(parsed_crc32) => parsed_crc32,
        Err(error) => return Err(decoding_error("gnuh".to_string())),
    };
    println!("parsed pcrc32 {:x}", part_crc32);
    computed_crc32 = crc32_combine(computed_crc32, part_crc32, part.infos.trailer.file_size);
}
println!("combined crc32 : {:x}", computed_crc32);