I implemented carry-less multiplciation using the CLMUL instruction set. This is similarly fast to simple modulo multiplication. But computating the result mod some polynomial is still very slow. I do it this way:
for (unsigned int i = 32; i-- > 0; )
{
if (c & (1L << (i + 32)))
{
c ^= 1L << (i + 32);
c ^= (uint64_t)p << i;
}
}
Where c is 64-bit carry-less product and p is some irreducible polynomial 32-bit. I’m not sure is that code correct. Can we do this faster?
If I’m right that we still have to do this rather expensive procedure after calculating the product, then I began to wonder if it would be reasonable to do carry-less multiplication alone just mod 2^k. Does carry-less multiplication mod 2^k itself also have similar advantages as carry-less multiplication mod polynomial?
It could be constant time, but is it uniformly distributed? In general is carry-less multiplication reasonable alternative to multiplication in GF(2^k)?