Hot Chips 2020 Live Blog: Cerebras WSE Programming (3:00pm PT)

06:03PM EDT – Cerebras did the wafer scale – a single chip the size of a wafer

06:03PM EDT – Here’s WSE1

06:04PM EDT – Uses standard tools like TensorFlow and pyTorch with Cerebras compiler

06:04PM EDT – CS-1 fits in standard 15U rack

06:04PM EDT – 400k cores

06:04PM EDT – (Costs a few $mil each)

06:05PM EDT – No DRAM, full on-chip SRAM

06:05PM EDT – 3D mesh network

06:06PM EDT – Allows all 400k cores to work on the same problem

06:06PM EDT – Linear perf scaling

06:07PM EDT – Cerebras graph compiler

06:07PM EDT – extract compute graph, create a graph in the WSE format, route kernals on the fabric, then create executable

06:09PM EDT – Graph matching for matrix multiply loops

06:09PM EDT – Supports hand optimized kernels

06:10PM EDT – Model-parallel and data-parallel optimization with the compiler

06:11PM EDT – Trade-off as resources vs compute for each kernel

06:12PM EDT – All kernels can be resized as needed

06:12PM EDT – All functionally identical

06:13PM EDT – Global optimization function to maximize throughput and utilization

06:14PM EDT – 3 key benefits

06:15PM EDT – Flexible parallelism

06:16PM EDT – Enough Fabric performance to connect everything at scale

06:16PM EDT – Otherwise slow across GPUs or a cluster

06:16PM EDT – Small batch size has super high utilization

06:16PM EDT – no weight sync overhead

06:19PM EDT – Core is designed for sparsity

06:20PM EDT – Intrinsic sparsity harvesting

06:20PM EDT – filters out all zeros

06:23PM EDT – ML user has full control over full range of sparse techniques

06:23PM EDT – WSE is MIMD, each core can be independent

06:24PM EDT – True variable sequence length support

06:24PM EDT – No padding required

06:24PM EDT – Higher utilization for irregular models

06:25PM EDT – Dynamic depth networks

06:26PM EDT – only need to process exact lengths of sequences

06:26PM EDT – World’s most powerful AI computer

06:27PM EDT – Complete flexibility due to the size of the wafer scale engine

06:28PM EDT – Working in the lab today

06:29PM EDT – More info later this year

06:29PM EDT – Q&A team

06:29PM EDT – time*

06:31PM EDT – Q: What’s the main benefit of WSE? A: Bypassing the issues with workloads that need multiple GPU/TPU/DPU. Opens up novel techniques that wouldn’t run on traditional hardware at any sense of speed

06:32PM EDT – Q: How to feed the beast? A: Traditional server might be the GPU/TPU, so the next one is IO. We are a system company, our product is the full system, as we control all aspects of the system. We have a 1.2 Tb/s ethernet interconnect to feed the engine to keep up with the compute

06:34PM EDT – Q: How long does it take a compile a model over 400k units? A: It’s an algorithmically complex search space problem. Annealing and heuristics bring that down – we are borrowing many ideas from the EDA industry. Our problem is simpler than billions of LEs on FPGAs, so we’re in the minutes.

06:34PM EDT – That’s a wrap. Next talk is 4096 RISC-V chip