[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: parallelism in a single instance of scrypt

To: scrypt@tarsnap.com
Subject: Re: parallelism in a single instance of scrypt
From: Solar Designer <solar@openwall.com>
Date: Sat, 24 Apr 2010 02:24:37 +0400
In-reply-to: <20100413192347.5ca44c61@gmail.com>
Mailing-list: contact scrypt-help@tarsnap.com; run by ezmlm
References: <20100408133830.GA12952@openwall.com> <20100413192347.5ca44c61@gmail.com>

Robert, Colin -

Thank you for your helpful replies.

On Tue, Apr 13, 2010 at 07:23:47PM -0700, Robert Ransom wrote:
> ... I just looked through the AVX instructions listed in
> section B.1.26 of the NASM 2.07 manual, and I didn't see any integer
> addition operations on 256-bit vectors.

I've just checked an Intel manual (AVX_319433-006.pdf), and surprisingly
you're right.  It says:

"Vector Integer" instructions are not promoted to 256-bit.

Yet there exist 256-bit bitwise operations - the "floating-point"
flavors of them - much like we had on the original SSE (before SSE2
introduced integer flavors of them).

Overall, I think it should be possible to use 256-bit bitwise ops, but
pairs of 128-bit integer addition instructions (plus some overhead to
do "128-bit" adds on the upper 128 bit halves of 256-bit registers).
Perhaps this will provide some speedup over a pure 128-bit implementation,
but less than a 2x speedup.

On the other hand, if scrypt was mostly built around a cryptographic
primitive requiring bitwise ops only, then a full 2x speedup would be
achievable.  The same would apply if scrypt was built around a bitslice
implementation of any cryptographic primitive, even one requiring adds.
Here's an example using MD5:

http://www.openwall.com/lists/john-users/2007/11/12/3
http://www.openwall.com/lists/john-users/2007/11/12/5

This bitslicing approach is useless when our architecture supports
vector adds with 32-bit elements, but it potentially becomes useful when
those are not fully supported.  Yet I expect that a traditional approach
would work better on AVX.  Vector adds with 32-bit elements would have
to be unsupported to a larger degree for the switch to bitslice to be
beneficial.

Yet it'd make more sense to build upon a more bitslice-friendly
primitive, and actually assume/use a bitslice implementation (thereby
avoiding any conversion overhead).  I've been toying with the idea of
building a "future-adaptable" crypt(3)-like function around bitslice DES
in late 1990s; I even had some pseudo-code for it (quite simple).  One
of its advantages would be the ease of expanding it to any vector size.

In fact, to some extent such expansion could be performed without even
changing a parameter value.  For example, with "virtual" vector size set
to 1024 bits, an implementation could be reasonably efficient on
shorter "physical" vectors as well - operating on their portions
sequentially.  Of course, this would not fit in registers, but a single
instance of bitslice DES (on the architecture's native vector size)
usually does not fit in registers anyway, yet it is CPU-bound since most
processing time is spent inside S-boxes, which usually manage to fit
their inputs and temporaries in registers.  The "sequential operation"
for emulating a larger virtual vector size would not have to be
performed at individual bitwise op level (which would make the task
cache- or memory-bound).  Instead, it could be performed at full-cipher
level, where each individual computation of the cipher (for the native
vector size) would have most of its data accesses fit in registers and
the cache.  In fact, I've tried this approach for an entirely different
purpose - JtR uses 8x32-bit vectors when building its bitslice DES
implementation for PA-RISC, because this proved to be faster on that
architecture (very fast cache).  On other architectures, similarly
increasing the "virtual" vector size beyond the physical one does not
change JtR's bitslice DES performance dramatically, which proves that
the approach is viable.

Now DES is probably something of the past (some "customers" would be
unhappy about it, even if its specific use is perfectly fine), but
bitslice AES is also possible (although it's probably harder to achieve
near-optimal S-box expressions for AES because of the larger number of
inputs - from 6 for DES up to 8 for AES, IIRC).

Returning to AVX specifics, it includes AES instructions (one round of
AES per instruction).  Obviously, this is incompatible with basing a KDF
around bitslice AES, yet it may be a reason to use non-bitslice AES.

Alexander

References:
- parallelism in a single instance of scrypt
  - From: Solar Designer <solar@openwall.com>
- Re: parallelism in a single instance of scrypt
  - From: Robert Ransom <rransom.8774@gmail.com>

Prev by Date: Re: parallelism in a single instance of scrypt
Next by Date: Re: parallelism in a single instance of scrypt
Previous by thread: Re: parallelism in a single instance of scrypt
Next by thread: Re: parallelism in a single instance of scrypt
Index(es):
- Date
- Thread