[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

parallelism in a single instance of scrypt



Hi Colin and all,

Colin - impressive work on scrypt.  Thanks!

I've skimmed over the two PDFs (paper and slides) and the source code
for scrypt-1.1.6.  Frankly, I have not fully "read" any of this yet, but
I thought I'd ask the following question now anyway, because I think it
should be answered in the paper and/or slides more prominently - such
that one would find the answer right away.  So by being lazy I am trying
to help improve the materials available on scrypt. ;-)

How much parallelism is there in a single instance of the scrypt key
derivation function?  I notice that you have an implementation for SSE2,
which means 4x parallelism (right?), but can it be extended further -
say, to use multiple CPU cores for a single scrypt computation?  What is
going to happen when x86-64 CPUs with AVX (256-bit vectors) hit the
market - will it be possible to make optimal use of them?  And chances
are that future CPUs and GPUs (or hybrids) will have even more cores and
longer vectors.  If this silicon can't be used optimally for "defensive"
uses of scrypt (enabling more extensive use of key stretching), then
that's a major shortcoming of scrypt.  (bcrypt and all others are
similarly problematic in this respect, which I complained about 10+
years ago, and I was hoping that the replacement would address this - in
fact, I did some work on this area at the time, but never finished that.)

How much pressure does scrypt place on the memory bus?  Is it CPU-bound
or I/O-bound?  If it's I/O-bound, then this might be good in terms of
required hardware cost (although this is questionable), yet it means
that it does not fully use the CPU, which is bad.  Sounds like a nasty
trade-off to me.  Ideally, a key derivation function should be able to
optimally use the hardware (all of: CPUs, GPUs, buses, memory) for a
_single_ instance of it on any common system that it is run on (of
course, the parameters may need to be tweaked by the system admin) - but
this goal is very hard to achieve.  Somehow I haven't seen the paper,
nor the slides, try to address this trade-off dilemma.  Have I missed it?

Thanks again,

Alexander