Ausnahme gefangen: SSL certificate problem: certificate is not yet valid ๐Ÿ“Œ Accelerating AI performance on 3rd Gen Intelยฎ Xeonยฎ Scalable processors with TensorFlow and Bfloat16

๐Ÿ  Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeitrรคge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden รœberblick รผber die wichtigsten Aspekte der IT-Sicherheit in einer sich stรคndig verรคndernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch รผbersetzen, erst Englisch auswรคhlen dann wieder Deutsch!

Google Android Playstore Download Button fรผr Team IT Security



๐Ÿ“š Accelerating AI performance on 3rd Gen Intelยฎ Xeonยฎ Scalable processors with TensorFlow and Bfloat16


๐Ÿ’ก Newskategorie: AI Videos
๐Ÿ”— Quelle: blog.tensorflow.org

A guest post by Niranjan Hasabnis, Mohammad Ashraf Bhuiyan, Wei Wang, AG Ramesh at Intel

The recent growth of Deep Learning has driven the development of more complex models that require significantly more compute and memory capabilities. Several low precision numeric formats have been proposed to address the problem. Google's bfloat16 and the FP16: IEEE half-precision format are two of the most widely used sixteen bit formats. Mixed precision training and inference using low precision formats have been developed to reduce compute and bandwidth requirements.

Bfloat16, originally developed by Google and used in TPUs, uses one bit for sign, eight for exponent, and seven for mantissa. Due to the greater dynamic range of bfloat16 compared to FP16, bfloat16 can be used to represent gradients directly without the need for loss scaling. In addition, it has been shown that mixed precision training using bfloat16 can achieve the same state-of-the-art (SOTA) results across several models using the same number of iterations as FP32 and with no changes to hyper-parameters.

The recently launched 3rd Gen Intelยฎ Xeonยฎ Scalable processor (codenamed Cooper Lake), featuring Intelยฎ Deep Learning Boost, is the first general-purpose x86 CPU to support the bfloat16 format. Specifically, three new bfloat16 instructions are added as a part of the AVX512_BF16 extension within Intel Deep Learning Boost: VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions allow converting to and from bfloat16 data type, while the last one performs a dot product of bfloat16 pairs. Further details can be found in the hardware numerics document published by Intel.

Intel has worked with the TensorFlow development team to enhance TensorFlow to include bfloat16 data support for CPUs. We are happy to announce that these features are now available in the Intel-optimized buildof TensorFlow on github.com. Developers can use the latest Intel build of TensorFlow to execute their current FP32 models using bfloat16 on 3rd Gen Intel Xeon Scalable processors with just a few code changes.

Using bfloat16 with Intel-optimized TensorFlow.

Existing TensorFlow 1 FP32 models (or TensorFlow 2 models using v1 compat mode) can be easily ported to use the bfloat16 data type to run on Intel-optimized TensorFlow. This can be done by enabling a graph rewrite pass (AutoMixedPrecisionMkl). The rewrite optimization pass will automatically convert certain operations to bfloat16 while keeping some in FP32 for numerical stability. In addition, models can also be manually converted by following instructions provided by Google for running on the TPU. However, such manual porting requires a good understanding of the model and can prove to be cumbersome and error prone.

TensorFlow 2 has a Keras mixed precision API that allows model developers to use mixed precision for training Keras models on GPUs and TPUs. We are currently working on supporting this API in Intel optimized TensorFlow for 3rd Gen Intel Xeon Scalable processors. This feature will be available in TensorFlow master branch later this year. Once available, we recommend users use the Keras API over the grappler pass, as the Keras API is more flexible and supports Eager mode.

Performance improvements.

We investigated the performance improvement of mixed precision training and inference with bfloat16 on 3 models - ResNet50v1.5, BERT-Large (SQuAD), and SSD-ResNet34. ResNet50v1.5 is a widely tested image classification model that has been included in MLPerf for benchmarking different hardware on vision workloads. BERT-Large (SQuAD) is a fine-tuning task that focuses on reading comprehension and aims to answer questions given a text/document. SSD-ResNet34 is an object detection model that uses ResNet34 as a backbone model.

The bfloat16 models were benchmarked on a 4 socket system with 3rd Gen Intel Xeon Scalable Processors with 28 cores* and compared with FP32 performance of a 4 socket system with 28 core 2nd Gen Intel Xeon Scalable Processors.


As shown in the charts above, training the models with mixed precision on a 3rd Gen Intel Xeon Scalable Processors with bfloat16 was 1.7x to 1.9x faster than FP32 precision on a 2nd Gen Intel Xeon Scalable Processors. Similarly, for inference, using bfloat16 precision resulted in a 1.87x to 1.9x performance increase.

Accuracy and time to train

In addition to performance measurements, we performed full convergence tests for the three deep learning models on two multi socket 3rd Gen Intel Xeon Scalable processor based systems*. For BERT-Large (SQuAD) and SSD-ResNet34, 4 socket 28 core systems were used. For ResNet50v1.5, we used an 8-socket 28 core system of 3rd Gen Intel Xeon Scalable processors. The models were first trained with FP32, and exactly the same hyper-parameters (learning rate etc.) and batch sizes were used to train the model with mixed precision.
The results above show that the models from three different use cases (image classification, language modeling, and object detection) are all able to reach SOTA accuracy using the same number of epochs. For ResNet50v1.5, the standard MLPerf threshold of 75.9% top-1 accuracy was used and both bfloat16 and FP32 reached the target accuracy in 84th epochs (evaluation every 4 epochs with eval offset of 0). For BERT-Large (SQuAD) fine-tuning task, both Bfloat16 and FP32 used two epochs. SSD-ResNet34, trained in 60 epochs. With the improved run time performance, the total time to train with bfloat16 was 1.7x to 1.9x better than the training time in FP32.

Intel-optimized Community build of TensorFlow

The Intel-optimized build of TensorFlow now supports Intelยฎ Deep Learning Boostโ€™s new bfloat16 capability for mixed precision training and low precision inference in the TensorFlow GitHub master branch. More information on the Intel build is available here. The models mentioned in this blog and scripts to run the models in bfloat16 and FP32 mode are available through the Model Zoo for Intel Architecture (v1.6.1 or later), which you can download and try from here. [Note: To run a bfloat16 model, you will need a Intel Xeon Scalable processor (Skylake) or later generation Intel Xeon Processor. However, to get the best performance of bfloat16 models, you will need a 3rd Gen Intel Xeon Scalable processor.]

Conclusion

As deep learning models get larger and more complicated, the combination of the latest 3rd Gen Intel Xeon Scalable processors with Intel Deep Learning Boostโ€™s new bfloat16 format can achieve a performance increase of up to 1.7x to 1.9x over FP32 performance on 2nd Gen Intelยฎ Xeonยฎ Scalable Processors, without any loss of accuracy. We have enhanced the Intel -optimized build of TensorFlow so developers can easily port their models to use mixed precision training and inference with bfloat16. In addition, we have shown that the automatically-converted bfloat16 model does not need any additional tuning of hyperparameters to converge; you canuse the same set of hyperparameters that you used to train the FP32 models.

Acknowledgements

The results presented in this blog is the work of many people including the Intel TensorFlow and oneDNN teams and our collaborators in Googleโ€™s TensorFlow team.

From Intel - Jojimon Varghese , Xiaoming Cui, Md Faijul Amin, Niroop Ammbashankar, Mahmoud Abuzaina, Sharada Shiddibhavi, Chuanqi Wang, Yiqiang Li, Yang Sheng, Guizi Li, Teng Lu, Roma Dubstov, Tatyana Primak, Evarist Fomenko, Igor Safonov, Abhiram Krishnan, Shamima Najnin, Rajesh Poornachandran, Rajendrakumar Chinnaiyan.

From Google - Reed Wanderman-Milne, Penporn Koanantakool, Rasmus Larsen, Thiru Palaniswamy, Pankaj Kanwar.

*For configuration details see www.intel.com/3rd-gen-xeon-configs.

Notices and Disclaimers

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
...



๐Ÿ“Œ Accelerating AI performance on 3rd Gen Intelยฎ Xeonยฎ Scalable processors with TensorFlow and Bfloat16


๐Ÿ“ˆ 137.39 Punkte

๐Ÿ“Œ Intel unveils 3rd Gen Intel Xeon Scalable processors, additions to its hardware and software AI portfolio


๐Ÿ“ˆ 75.76 Punkte

๐Ÿ“Œ Intel Launches First 10nm 3rd Gen Xeon Scalable Processors For Data Centers


๐Ÿ“ˆ 69.22 Punkte

๐Ÿ“Œ Intel launches 3rd Gen Intel Xeon Scalable processor for data centers


๐Ÿ“ˆ 60.42 Punkte

๐Ÿ“Œ Intel Xeon E3/Xeon Scalable/Xeon D DCI privilege escalation [CVE-2018-3652]


๐Ÿ“ˆ 59.88 Punkte

๐Ÿ“Œ Intel Xeon E3/Xeon Scalable/Xeon D DCI erweiterte Rechte [CVE-2018-3652]


๐Ÿ“ˆ 59.88 Punkte

๐Ÿ“Œ CVE-2023-20566 | AMD 3rd Gen EPYC Processors/4th Gen EPYC Processors ASP memory corruption


๐Ÿ“ˆ 57.46 Punkte

๐Ÿ“Œ CVE-2023-20573 | AMD 3rd Gen EPYC Processors/4th Gen EPYC Processors Debug Information denial of service


๐Ÿ“ˆ 57.46 Punkte

๐Ÿ“Œ Lenovo revamps ThinkSystem lineup as Intel launches next-gen Xeon Scalable processors


๐Ÿ“ˆ 55.24 Punkte

๐Ÿ“Œ Hyve Solutions Leveraging 4th Gen Intel Xeon Scalable Processors


๐Ÿ“ˆ 55.24 Punkte

๐Ÿ“Œ Optimizing TensorFlow for 4th Gen Intel Xeon Processors


๐Ÿ“ˆ 50.67 Punkte

๐Ÿ“Œ Intel Launches Xeon Scalable CPUs: Dual Xeon Platinum 8176, 112 Threads Tested


๐Ÿ“ˆ 46.69 Punkte

๐Ÿ“Œ Intel launches third-gen Intel Xeon Scalable processor for data centers


๐Ÿ“ˆ 46.44 Punkte

๐Ÿ“Œ Intel announces 13th Gen mobile processors, plus 65-watt and 35-watt desktop processors


๐Ÿ“ˆ 41.83 Punkte

๐Ÿ“Œ Accelerating TensorFlow Lite Micro on Cadence Audio Digital Signal Processors


๐Ÿ“ˆ 40.84 Punkte

๐Ÿ“Œ Intel Adds TDX to Confidential Computing Portfolio With Launch of 4th Gen Xeon Processors


๐Ÿ“ˆ 39.68 Punkte

๐Ÿ“Œ Intel unveils Intel Xeon Scalable platform to help customers improve their security posture


๐Ÿ“ˆ 38.26 Punkte

๐Ÿ“Œ AMD Shows Off Impressive Ryzen 5000 Mobile Processors and 3rd Gen Epyc Server Chips


๐Ÿ“ˆ 37.51 Punkte

๐Ÿ“Œ Intel Xeon E3 processors produced since at least mid-2017 do not have Intel ME


๐Ÿ“ˆ 36.25 Punkte

๐Ÿ“Œ CES 2023: Intel's 13th Gen mobile processors boast 64% GPU performance gains and new video processing unit


๐Ÿ“ˆ 35.57 Punkte

๐Ÿ“Œ Accelerating TensorFlow Performance on Mac


๐Ÿ“ˆ 34.57 Punkte

๐Ÿ“Œ Intel Xeon: Die "Scalable Family" der nรคchsten Generation


๐Ÿ“ˆ 33.5 Punkte

๐Ÿ“Œ Intel Announces Xeon Scalable Processor Family


๐Ÿ“ˆ 33.5 Punkte

๐Ÿ“Œ Xeon Scalable Platform: Intel stellt neue Server-Prozessoren vor


๐Ÿ“ˆ 33.5 Punkte

๐Ÿ“Œ UCS-Server: Cisco integriert Intel Xeon Scalable-Prozessoren


๐Ÿ“ˆ 33.5 Punkte

๐Ÿ“Œ Skylake-SP: Intel stellt erste Generation Xeon Scalable ein [Notiz]


๐Ÿ“ˆ 33.5 Punkte

๐Ÿ“Œ Intel Unveils New 9th Generation, Core X, and 28-Core Xeon Processors


๐Ÿ“ˆ 33.28 Punkte

๐Ÿ“Œ Intel announces new 11th Gen Intel Core mobile processors, first 5G M.2 solution and more during Computex keynote


๐Ÿ“ˆ 33.03 Punkte

๐Ÿ“Œ ASUS unveils new laptop lineup with 11th Gen Intel Core Processors and debuts first portable PC verified as an Intel Evo platform design


๐Ÿ“ˆ 33.03 Punkte

๐Ÿ“Œ CES 2023: Intel introduces 13th Gen Intel Core mobile processors, vision processing unit and more


๐Ÿ“ˆ 33.03 Punkte

๐Ÿ“Œ Accelerating TensorFlow on Intel Data Center GPU Flex Series


๐Ÿ“ˆ 32.04 Punkte

๐Ÿ“Œ Intels Xeons mit bfloat16-Instruktion haben 56 Kerne


๐Ÿ“ˆ 31.8 Punkte

๐Ÿ“Œ Cooper Lake: Intels Xeons mit bfloat16-Instruktion haben 56 Kerne


๐Ÿ“ˆ 31.8 Punkte

๐Ÿ“Œ (g+) Flexpoint, Bfloat16, TensorFloat32, FP8: Dank KI zu neuen Gleitkommazahlen


๐Ÿ“ˆ 31.8 Punkte











matomo