README CUDA:

CUDA support is tested under Linux and OSX. Some recent Linux distributions
have proprietry OpenCL and CUDA drivers included, for others (or for too new
cards) you will have to install nvidia's drivers. OSX will support OpenCL
natively but for using CUDA you need to download drivers from nvidia web site.

If your device supports CUDA it usually also supports OpenCL. Exceptions are
a few very old cards. See README-OPENCL for more OpenCL information.


Compiling:
  The autoconf system (./configure) should find your CUDA installation and add
  CUDA support to your build automatically. If it does not, you might have to
  pass some parameters about where it's located, eg.
      ./configure LDFLAGS=-L/opt/cuda/lib CFLAGS=-I/opt/cuda/include

  To build without CUDA, use:
      ./configure --disable-cuda".

  As a last resort, you can use the legacy Makefile:
      make -f Makefile.legacy linux-x86-64-gpu


If you have problems with JtR CUDA support:
  Please check that your PATH contains cuda bin (eg. /usr/local/cuda/bin)
  and that your LD_LIBRARY_PATH (or equivalent) contains cuda lib path
  (eg. /usr/local/cuda/lib and/or /usr/local/cuda/lib64).


Performance issues:
  If you have got Fermi or newer card change "-arch sm_10" to "-arch sm_20" in
  the NVCC_FLAGS (Makefile). If you have Kepler or newer, you can experiment
  with using "-arch sm_30" as well but some formats actually perform better
  with sm_20.

  To get better performance you can experiment with THREADS and BLOCKS macros
  defined for each format in cuda*.h file. Default THREADS and BLOCKS settings
  are not likely optimal. For some (weak) cards, THREADS may even be too high
  and this will show up as eg. "too many resources requested for launch in
  cryptmd5.cu at line 241". If this happens, try halving THREADS and doubling
  BLOCKS in cuda_cryptmd5.h, and then re-build. For more powerful cards, you
  can try just doubling THREADS until it gets too high (error) or suboptimal
  (slower), then settle for previous figure.

  For MSCash2[1]:
    CARD NAME   BLOCKS  THREADS SM      RESULT
    GTX460      14      128     20      14194 c/s
  For WPAPSK[1]:
    CARD NAME   BLOCKS  THREADS SM      RESULT
    GTX460      14      256     20      15186 c/s
  For PWSAFE[1]:
    CARD NAME   BLOCKS  THREADS SM      RESULT
    GTX460      112     512     10      50746 c/s
  For XSHA512[2]:
    CARD NAME   BLOCKS  THREADS SM      RESULT
    GTX570      1600    256     ??      67385K c/s
  For RAWSHA256[1]:
    CARD NAME   BLOCKS  THREADS SM      RESULT
    GTX570      7680    128     10      27561K c/s


OpenCL vs. CUDA parlor:
  Compared to OpenCL, THREADS equals "local worksize" while OpenCL's "global
  worksize" is what you get from THREADS*BLOCKS (note the difference).

  Example: The md5crypt-cuda format ships with THREADS=256 and BLOCKS=84 (28*3)
  hard-coded in cuda_cryptmd5.h. It performs poorly on powerful devices with
  these figures, while the OpenCL format auto-tunes and performs 23% better:

      Benchmarking: md5crypt [CUDA]... DONE
      Raw:    728361 c/s real, 728361 c/s virtual

      Local worksize (LWS) 128, Global worksize (GWS) 2097152
      Benchmarking: md5crypt [OpenCL]... DONE
      Raw:    892405 c/s real, 890510 c/s virtual

  Since the md5crypt-opencl auto-tuned to LWS=128 and GWS=2097152, we can
  edit cuda_cryptmd5.h and use THREADS=128 and BLOCKS=16384 (ie. 2097152/128).
  Now, THREADS equals LWS and THREADS*BLOCKS equals GWS, so we should have
  equal sizes for comparison. Now re-build John and perform a new benchmark:

      Benchmarking: md5crypt [CUDA]... DONE
      Raw:    907858 c/s real, 909827 c/s virtual

  That's even better than the OpenCL figure at the time of writing.


Supported formats:
  md5crypt-cuda     md5crypt
  mscash-cuda       M$ Cache Hash MD4
  mscash2-cuda      M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1
  phpass-cuda       phpass MD5
  pwsafe-cuda       Password Safe SHA-256
  raw-sha224-cuda   Raw SHA-224
  raw-sha256-cuda   Raw SHA-256
  raw-sha512-cuda   Raw SHA-512
  sha256crypt-cuda  sha256crypt
  sha512crypt-cuda  sha512crypt
  wpapsk-cuda       WPA-PSK PBKDF2-HMAC-SHA-1
  xsha512-cuda      Mac OS X 10.7+ salted SHA-512


Watchdog Timer:
  If your GPU is also your active display device, the display driver enables a
  watchdog timer by default, killing any kernel that runs for more than about
  five seconds. You will normally not get a proper error message, just some
  kind of failure after five seconds or more. Our goal is to split such kernels
  into subkernels with shorter durations but in the meantime (and especially if
  running slow kernels on weak devices) you might need to disable this
  watchdog. You can check this setting using "--list=cuda-devices":

    CUDA Device #0
        Name:                          GeForce GT 650M
        Compute capability:            sm_30
        Number of stream processors:   384 (2 x 192)
        Clock rate:                    878 Mhz
        Total global memory:           1.0 GB
        Total shared memory per block: 48.0 KB
        Total constant memory:         64.0 KB
        Kernel execution timeout:      No        <-- disabled watchdog
        Concurrent copy and execution: Yes
        Concurrent kernels support:    Yes
        Warp size:                     32

  We are currently not aware of any way to disable this watchdog under OSX.
  Under Linux, you can disable it by adding the 'Option "Interactive"' line
  to /etc/X11/xorg.conf:

    Section "Device"
        Identifier     "Device0"
        Driver         "nvidia"
        VendorName     "NVIDIA Corporation"
        Option         "Interactive"        "False"
    EndSection



You can contact us at
[1] lukas[dot]odzioba[at]gmail[dot]com
[2] qqlddg[at]gmail[dot]com
or john-dev mailing list
or irc #openwall@freenode
