Lädt...

💾 trunk/9867bb37683bd898d547744e95f9916f8395f44c: Fix CPU GEMM k-slicing cache-block indexing (#183733)


Nachrichtenbereich: 💾 Downloads
🔗 Quelle: github.com

Correct the CPU GEMM k-slicing reduction path when an N thread block is split into multiple cache blocks so the local buffer slots, row stride, and store slices all use cache-block dimensions... [Weiterlesen]

🔧 DeepGEMM Essentials: High-Performance FP8 Matrix Multiplication


📈 430.94 Punkte
🔧 Programmierung

🔧 Writing High-Performance Kernels in TileLang, from GEMM to MLA


📈 389.9 Punkte
🔧 Programmierung

🔧 DeepSeek DeepGEMM 中文讲解


📈 287.29 Punkte
🔧 Programmierung

🔧 Apple Silicon's AI Ceiling Is Higher Than You Think


📈 184.69 Punkte
🔧 Programmierung

🔧 How to Read GPU Profiling Logs: A Ground-Up Guide


📈 164.17 Punkte
🔧 Programmierung

🔧 Advanced GPU Optimization: Tensor Core Programming (NVIDIA)


📈 143.65 Punkte
🔧 Programmierung

🔧 NVIDIA CUTLASS: High-Performance CUDA Templates for AI Linear Algebra


📈 102.61 Punkte
🔧 Programmierung

🔧 Pyptx: Write Nvidia PTX Kernels in Python for Hopper and Blackwell


📈 102.61 Punkte
🔧 Programmierung

🔧 If Memory Could Compute, Would We Still Need GPUs?


📈 82.08 Punkte
🔧 Programmierung

🔧 CUDA Memory Hierarchy, Tile Programming, & DLSS 310.6 Driver Enhancements


📈 61.56 Punkte
🔧 Programmierung

🔧 numr 0.5.0: The Rust numerical computing library that doesn't make you choose


📈 61.56 Punkte
🔧 Programmierung

🔧 Proof-of-Work as a Hidden Subsidy


📈 61.56 Punkte
🔧 Programmierung

🔧 AxonML -- A PyTorch-equivalent ML framework written in Rust


📈 41.04 Punkte
🔧 Programmierung

🔧 The Ghost in the Batch: How vLLM Silently Switches Algorithms


📈 41.04 Punkte
🔧 Programmierung

🔧 VHE: GPU-Accelerated Gate-Level Simulation at Zero License Cost


📈 41.04 Punkte
🔧 Programmierung

🔧 HeMA-MISO: Heterogeneous Memory Architecture for LLM Inference with SW Optimization


📈 41.04 Punkte
🔧 Programmierung

🔧 Kog hits 3K t/s on MI300X, no kernel switches — test it now


📈 41.04 Punkte
🔧 Programmierung

💾 ciflow/torchtitan/186752: [Inductor] Host-side TMA descriptors for Blackwell mm templates


📈 41.04 Punkte
💾 Downloads

💾 trunk/9867bb37683bd898d547744e95f9916f8395f44c: Fix CPU GEMM k-slicing cache-block indexing (#183733)


📈 41.04 Punkte
💾 Downloads

🔧 Intel Xe3P Leaks 160GB LPDDR5X; FlashAttention-2 in CuTe & Custom CUDA GPT-2 Engine


📈 41.04 Punkte
🔧 Programmierung

🔧 20260324_ai_bubble_8gb_en


📈 20.52 Punkte
🔧 Programmierung

🔧 2.78 TFLOPS on a Fanless MacBook Air? Benchmarking Apple's M4 with MLX


📈 20.52 Punkte
🔧 Programmierung

🔧 MiniMax M3 Explained: The Sparse Attention Breakthrough


📈 20.52 Punkte
🔧 Programmierung

🔧 How I Added AI Image Search to a Marketplace Bot (And Why It Changed Everything)


📈 20.52 Punkte
🔧 Programmierung

📰 Tencent Hunyuan Releases HPC-Ops: A High Performance LLM Inference Operator Library


📈 20.52 Punkte
🔧 AI Nachrichten

💾 viable/strict/1781213548: Add PyTorch QuACK GEMM epilogue adapter e.g FlexGemm (#186483)


📈 20.52 Punkte
💾 Downloads

📰 Learning Triton One Kernel at a Time: Matrix Multiplication


📈 20.52 Punkte
🔧 AI Nachrichten

💾 trunk/319ee4ea19c03438d7ae3c585bc65b11fb9dd266: Add PyTorch QuACK GEMM epilogue adapter (#186310)


📈 20.52 Punkte
💾 Downloads

🔧 ARM System-on-Chip (SoC) Deep Dive: Edge AI and Coherency Fabric


📈 20.52 Punkte
🔧 Programmierung

🔧 Running PyTorch fork-safe in Celery on macOS


📈 20.52 Punkte
🔧 Programmierung