Une version Française de ce post est disponible.
I'm working on a Rust project where I need to check a CRC32 checksum of a file. It is somewhat easy to do so using the CRC crate. This crate provides multiple methods to compute a CRC32.
CRC, Cyclic Redundancy Code, computing enable one receiver to check that transmitted data is error free. To do so, the data sender compute a checksum that depends on the sent data and transmit it. The receiver also compute the checksum, this time from the received data and match it against the sender's one. If they are equal, data was transmitted without error.
My goal was to reconstruct a file split in multiple encoded chunks. yEnc anyone ?
Each chunk has it's own CRC32 checksum embedded as an hexadecimal string, that I can match against once I have decoded it's data.
Something like that :
fn check_crc32(input: &String, crc32: u32) -> bool {
match u32::from_str_radix(input, 16) {
Ok(parsed_crc32) => parsed_crc32 == crc32,
Err(_) => false
}
}
...
check_crc32("aca76043", my_computed_crc)
Once I have decoded all the file's chunks, I recreate the file by writing all of the chunks at their destination offset in my output file.
One of the chunks may have the overall CRC32 checksum of the file. If I have one, I would like to verify it against my reconstructed file's one.
First try
When decoding chunks, I first thought about sharing a crc32::Digest
and update
it while writing decoded data to my Vec<u8>
destination buffer. Once all
chunks have been decoded, I would ask the digest for the overall CRC32 sum.
After a bit of struggling with the borrow checker, I succeeded in decoding a 175 parts, 100MB file. Noice !
However, when I checked my computed CRC32, it :
- was not the same as the one provided in the file chunks and definitely
- was not the same one as computed by the
crc32
unix utility.
But my data written to disk matched the provided CRC32. So, I was obviously doing something wrong.
Not so fast, you fool
To speed up the decoding, I was reading one chunk at a time, decoding it to memory and updating the digest on the fly. Once in a while, I would flush to disk the decoded data, keeping in memory only metadata about chunks already read.
I noticed quite fast that reading data chunks in a random order was not such a good idea...
Indeed, CRC32 checksum is computed as data is written to the Digest. So updating it by throwing decoded data chunks in random order wasn't the right way to do it.
I could have reopened the file, reading every bytes out of it and recomputed the checksum. But I thought that if I was reconstructing a 4GB data file I wouldn't want to spend my time recomputing the checksum by reading again all of the data that I just had decoded.
Furthermore, if I wanted to multithread fearlessly the decoding or checksuming I needed to find another path.
And... what if I could compute the CRC32 of the whole file by combining all of the given chunks' checksums ? THAT would be nice.
Standing on the shoulders of giants
Obviously, this is an issue that others already tackled. And way better than you would ever conceive with your single mind.
In the famous zlib library Mark Adler wrote this not so well documented function : https://github.com/madler/zlib/blob/master/crc32.c#L372 .
I won't go into the nitty gritty details of how it works, he already explained it long and wide here.
What I know is that I needed it. And there was no such function in the CRC crate.
A quick web search about it reflected that most of the time people prefer to depend on the libz-sys crate. But I didn't want to depend on many numerous crates on my tiny side project. Rather than doing unsafe rust by calling the C library, I decided to take the bull by the horns and going pure rusty !
The solution
If you are looking for a pure rust crc32_combine
function, here you go.
You'll guess that it is massively inspired by the C function of the Zlib.
Combines two CRC checksums into one. For two data sets, Data1 and Data2 with sizes of len1 and len2, with CRC checksums crc1 and crc2. This function returns the CRC checksum of [Data1,Data2], requiring only crc1, crc2, and len2.
/* zlib.h -- interface of the 'zlib' general purpose compression library
version 1.2.11, January 15th, 2017
Copyright (C) 1995-2017 Jean-loup Gailly and Mark Adler
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Jean-loup Gailly Mark Adler
jloup@gzip.org madler@alumni.caltech.edu
*/
const GF2_DIM: usize = 32; // dimension of GF(2) vectors (length of CRC)
fn gf2_matrix_times(mat: &[u32], vec: u32) -> u32 {
let mut sum = 0;
let mut idx = 0;
let mut vec = vec;
while vec != 0 {
if (vec & 1) != 0 {
sum ^= mat[idx];
}
vec >>= 1;
idx += 1;
}
sum
}
fn gf2_matrix_square(square: &mut [u32; GF2_DIM], mat: &[u32; GF2_DIM]) {
for n in 0..GF2_DIM {
let d = mat[n];
square[n] = gf2_matrix_times(mat, d);
}
}
fn crc32_combine(crc1: u32, crc2: u32, len2: u32) -> u32 {
let mut even: [u32; GF2_DIM] = [0; GF2_DIM]; // even-power-of-two zeros operator
let mut odd: [u32; GF2_DIM] = [0; GF2_DIM]; // odd-power-of-two zeros operator
let mut crc1 = crc1;
let mut len2 = len2;
// degenerate case (also disallow negative lengths
if len2 <= 0 {
return crc1;
}
// put operator for one zero bit in odd
odd[0] = 0xedb88320; // CRC-32 polynomial
let mut row = 1;
for i in 1..GF2_DIM {
odd[i] = row;
row <<= 1;
}
// put operator for two zero bits in even
gf2_matrix_square(&mut even, &odd);
// put operator for four zero bits in odd
gf2_matrix_square(&mut odd, &even);
// apply len2 zeros to crc1 (first square will put the operator for one
// zero byte, eight zero bits, in even)
loop {
// apply zeros operator for this bit of len2
gf2_matrix_square(&mut even, &odd);
if (len2 & 1) != 0 {
crc1 = gf2_matrix_times(&even, crc1);
}
len2 >>= 1;
// if no more bits set; then done
if len2 == 0 {
break;
}
// another iteration of the loop with odd and even swapped
gf2_matrix_square(&mut odd, &mut even);
if (len2 & 1) != 0 {
crc1 = gf2_matrix_times(&odd, crc1);
}
len2 >>= 1;
if len2 == 0 {
break;
}
}
// return combined crc
crc1 ^= crc2;
return crc1;
}
Usage example :
let mut computed_crc32 = 0;
for part in &self.parts {
let part_crc32 = match u32::from_str_radix(
&part.infos.trailer.part_crc32.as_ref().unwrap().clone(),
16,
) {
Ok(parsed_crc32) => parsed_crc32,
Err(error) => return Err(decoding_error("gnuh".to_string())),
};
println!("parsed pcrc32 {:x}", part_crc32);
computed_crc32 = crc32_combine(computed_crc32, part_crc32, part.infos.trailer.file_size);
}
println!("combined crc32 : {:x}", computed_crc32);