Introducing Facebook's software and hardware infrastructure for machine learning to meet its global scale computing needs

At the end of 2017, the Facebook Application Machine Learning Group released the latest paper, introducing the entire Facebook machine learning software and hardware architecture. Looking at the full text, we can also get a glimpse of the machine learning strategies of Facebook products. The paper deals with the new challenges of machine learning on a global scale (100 million data processing), and gives Facebook's coping strategies and solutions, which are extremely meaningful for related industries and research.

Summary

Machine learning has a pivotal position in many of Facebook's products and services. This article will detail how Facebook's hardware and software infrastructure in machine learning can meet its global scale computing needs.

Facebook's machine learning needs are extremely complex: you need to run a lot of different machine learning models. This complexity is already deep in all levels of the Facebook system stack. In addition, a significant portion of all of the data stored by Facebook flows through the machine learning pipeline, and this data load puts tremendous pressure on Facebook's distributed high-performance training stream.

The computational demands are also very tight, and balancing the large amount of CPU capacity for real-time reasoning while maintaining the GPU/CPU platform for training also brings extreme tension. The resolution of these and other challenges remains to be seen in our long-lasting efforts across machine learning algorithms, software and hardware design.

introduction

Facebook's mission is to "build social relationships for human beings and make the world more connected." As of December 2017, Facebook has connected more than 2 billion people worldwide. At the same time, in the past few years, machine learning has also undergone a revolution on such a global scale of practical problems, including a virtuous cycle of machine learning algorithm innovation, massive data for model training, and high-performance computer architecture. progress.

On Facebook, machine learning plays a key role in almost all aspects of improving the user experience, including services such as news feed voice and text translation, and ranking of photos and live video categories.

Facebook uses a variety of machine learning algorithms in these services, including support vector machines, gradient boosted decision trees, and many types of neural networks. This article will introduce several important aspects of Facebook's data center architecture to support machine learning needs. Its architecture includes internal "ML-as-a-Service" streams, open source machine learning frameworks, and distributed training algorithms.

From a hardware perspective, Facebook uses a large number of CPU and GPU platforms to train the model to support the training frequency of the model for the required service delay time. For the machine learning reasoning process, Facebook relies mainly on the CPU to handle all major services, and the neural network ranking service (such as news push) occupies the bulk of all computing load.

A large part of the vast amount of data stored by Facebook goes through the machine learning pipeline, and in order to improve the quality of the model, the amount of data in this part is still increasing over time. The sheer volume of data needed to provide machine learning services is a challenge that Facebook's data centers will face on a global scale.

Currently available techniques that can be used to efficiently provide data to models include decoupling of data feedback and training, data/computing co-location, and network optimization. At the same time, Facebook's large computing and data scale itself presents a unique opportunity. During the daily load cycle, a large number of CPUs that can be used for distributed training algorithms are idle during off-peak hours.

Facebook's computing cluster (fleet) involves dozens of data centers, so the large scale also provides a disaster tolerance. Timely delivery of a new machine learning model is very important for the operation of the Facebook business. To ensure this, disaster recovery planning is also critical.

Looking ahead, Facebook hopes to see a rapid increase in the frequency of machine learning used in its existing and new services. Of course, this growth will also pose a more serious challenge to the global scale of the team responsible for these service architectures. While optimizing the infrastructure on existing platforms is a significant opportunity for the company, we are still actively evaluating and exploring new hardware solutions while maintaining a focus on algorithm innovation.

The main contents of this article (Facebook's view of machine learning) include:

Machine learning is being widely used in almost all of Facebook's services, and computer vision is only a small part of the resource requirements.

The large number of machine learning algorithms that Facebook requires are extremely complex, including but not limited to neural networks.

Our machine learning pipeline is processing massive amounts of data, and this brings engineering and efficiency challenges beyond compute nodes.

Facebook's current reasoning process relies mainly on the CPU, and the training process relies on both CPU and GPU. However, from the perspective of performance-to-power ratio, new hardware solutions should be continuously explored and evaluated.

Devices used by users around the world can reach hundreds of millions of devices per day, and this will provide a large number of machines that can be used for machine learning tasks, such as large-scale distributed training.

Facebook machine learning

Machine Learning (ML) is an application example that uses a series of inputs to create a tunable model and use it to create a representation, prediction, or other form of useful signal.

Introducing Facebook's software and hardware infrastructure for machine learning to meet its global scale computing needs

Figure 1. Example of Facebook's machine learning process and architecture

The flow shown in Figure 1 consists of the following steps, alternating:

Establish the training phase of the model. This phase is usually run offline.

Run the inference phase of the training model in the application and perform (a set of) real-time predictions. This phase is performed online.

Models are trained much less frequently than reasoning—the scale of reasoning is constantly changing, but it is usually around a few days. Training also takes a long time to complete, usually hours or days. At the same time, depending on the actual needs of the product, the online reasoning phase may run hundreds of thousands of times per day, and generally needs to be performed in real time. In some cases, especially for referral systems, additional training needs to be done online in such a continuous manner.

On Facebook, a notable feature of machine learning is the sheer volume of data available for model training. The scale of this data will bring a lot of impact on the entire machine learning architecture.

Main services using machine learning

Message push

The message push ranking algorithm enables users to see the most important things for them each time they visit Facebook. The general model uses training to determine the various user and environmental factors that influence the ordering of the content. Later, when the user visits Facebook, the model generates an optimal push from thousands of candidates, which is a personalized collection of images and other content, and the best sorting of the selected content.

advertising

The advertising system uses machine learning to determine what kind of advertisements to display to a particular user. By training the ad model, we can understand user characteristics, user context, previous interactions, and ad attributes to learn how to predict the most likely clicks on the site. Later, when the user visits Facebook, we pass the input into the trained model run, and we can immediately determine which ads to display.

search for

The search launches a series of specific sub-search processes for various vertical types (eg, videos, photos, people, events, etc.). The classifier layer runs before various types of vertical type searches to predict which of the vertical types to search for, otherwise such vertical type searches will be invalid. The classifier itself and various vertical searches contain a training offline phase, and an online phase that runs the model and performs classification and search functions.

Sigma

Sigma is a general framework for classification and anomaly detection for monitoring a variety of internal applications, including site integrity, spam detection, payment, registration, unauthorized employee access, and event recommendations. Sigma contains hundreds of different models that run every day in production, and each model is trained to detect anomalies or more generally classify content.

Lumos

Lumos is able to extract advanced attributes and mappings from images and their content so that the algorithm can automatically understand them. This data can be used as input to other products and services, such as in the form of text.

Facer

Facer is Facebook's face detection and recognition framework. Given an image, it first looks for all the faces in the image. It is then determined whether the face in the figure is a friend of the user by running a face recognition algorithm for a particular user. Facebook uses this service to recommend users who want to be tagged in photos.

language translation

Language translation is a service that involves the international exchange of Facebook content. Facebook supports source or target language translations in more than 45 languages, which means that Facebook supports more than 2,000 translation directions, such as English to Spanish, Arabic to English. Through these more than 2,000 translation channels, Facebook provides a 4.5-word translation service every day. By translating users' messages, Facebook can relieve language barriers for 600 million people worldwide. Currently, each language has its own model for direction, but we are also considering a multilingual model [6].

Speech Recognition

Speech recognition is a service that converts an audio stream into text. It can automatically fill subtitles for video. At present, most of the streaming media is in English, but the recognition of other languages ​​in the future will also be supported. In addition, non-verbal audio files can also be detected with a similar system (a simpler model).

In addition to the main products mentioned above, there are more long tail services that take advantage of various forms of machine learning. Facebook products and services have hundreds of long tails.

Machine learning model

All machine learning-based services use "features" (or inputs) to produce quantized output. The machine learning algorithms used by Facebook include Logistic Regression (LR), Support Vector Machine (SVM), Gradient Elevation Decision Tree (GBDT) and Deep Neural Network (DNN).

LR and SVM are very effective in training and forecasting. GBDT can increase accuracy by increasing computing resources. DNN is the most expressive, providing the highest accuracy, but with the most resources (in terms of computational complexity, at least an order of magnitude higher than linear models such as LR and SVM).

The free parameters of these three models are becoming more and more, and the accuracy of prediction must be optimized by using labeled input examples.

In deep neural networks, there are three types of networks that are frequently used: multilayer perceptron (MLP), convolutional neural network (CNN), and recurrent neural network (RNN/LSTM). MLP networks typically run on structured input features (usually rankings), RNN / LSTM networks are typically used to process time domain data, ie as sequence processors (usually language processing), and relative CNNs are a type of processing. A tool for spatial data (usually image processing). Table I shows the mapping between these machine learning model types and products/services.

Introducing Facebook's software and hardware infrastructure for machine learning to meet its global scale computing needs

Table 1 Products or services that utilize machine learning algorithms

ML-as-a-Service in Facebook

To simplify the task of applying machine learning to our products, we built some internal platforms and toolkits, including FBLearner, Caffe2 and PyTorch. FBLearner is a suite of three tools (FBLearner Feature Store, FBLearner Flow, FBLearner Predictor), each of which is responsible for the different parts of the machine learning pipeline. As shown in Figure 1 above, it utilizes an internal job scheduler to allocate resources and schedule jobs across the shared resource pool of the GPU and CPU. The training process for most of the machine learning models on Facebook is done on the FBLearner platform. These tools and platforms are designed to help machine learning engineers increase efficiency and focus on algorithm innovation.

FBLearner Feature Store. The starting point for any machine learning modeling task is to collect and generate features. The FBLearner Feature Store is essentially a catalog of feature generators whose feature generators can be used for training and real-time forecasting, but it can also be used as a market place for multiple teams to share and find features. This is a good platform for a team that is just beginning to use machine learning, and it also helps to apply new features to existing models.

FBLearner Flow is a machine learning platform that Facebook uses to train models. Flow is a pipeline management system that performs a workflow that describes the steps required to model and/or evaluate the model and the resources it requires. This workflow consists of discrete units or operators, each with inputs and outputs. The connection between the operators is automatically inferred by tracking the flow of one operator to the next operator, and Flow performs the workflow by processing the scheduling and resource management. Flow also has a tool for experiment management and a simple user interface that tracks all the artifacts and metrics generated by each workflow or experiment, making it easy to compare and manage these experiments.

FBLearner Predictor is Facebook's internal reasoning engine that provides real-time predictions using models trained in Flow. Predictor can be used as a multi-tenancy service or as a library integrated into the backend services of a specific product. Many of Facebook's product teams use Predictor, and many of them require low latency solutions. Direct integration between Flow and Predictor also helps run online experiments and manage multiple versions of the model in production.

Deep learning framework

We use two distinct collaborative frameworks for deep learning on Facebook: PyTorch for research optimization and Caffe2 for production optimization.

Caffe2 is Facebook's in-house production framework for training and deploying large-scale machine learning models. Caffe2 focuses on several key features required for the product: performance, cross-platform support and basic machine learning algorithms such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Multilayer Perceptron (MLP). These networks all have sparse or dense connections and parameters of up to tens of billions. The framework is designed in a modular approach that shares a unified graph representation across all backend implementations (CPU, GPU, and accelerator). To achieve optimal runtime on different platforms, Caffe2 also abstracts third-party libraries including cuDNN, MKL and Meta.

PyTorch is Facebook's preferred framework for AI research. Its front end focuses on flexibility, debugging, and dynamic neural networks to enable rapid experimentation. Because it relies on Python to execute, it is not optimized for production and mobile deployment. When a research project produces valuable results, the model needs to be transferred to production. In the past, in a production environment, we completed the model transfer by rewriting the training pipeline of the product environment using other frameworks. Recently, Facebook began to build the ONNX toolchain to simplify this transfer process. For example, dynamic neural networks are used for sophisticated artificial intelligence research, but these models take longer to be applied to products. Through the decoupling framework, we avoid the need to design more complex execution engines (such as Caffe2) to meet performance. In addition, researchers are more concerned with their flexibility when conducting research than the speed of the model. For example, in the model exploration phase, a 30% performance degradation can be tolerated, especially when it has the advantages of easy testing and model visualization. But the same method is not suitable for production. This trade-off principle can also be seen in the framework design of PyTorch and Caffe2. PyTorch provides good default parameters and reasonable performance, while Caffe2 can choose to use asynchronous graph execution, quantization weights and multiple dedicated backends to achieve Best performance.

Although the FBLearner platform itself does not restrict the use of any framework, whether it is Caffe2, TensorFlow, PyTorch or other frameworks, our AI Software Platform team is specifically designed to allow FBLearner to integrate well with Caffe2. optimization. In general, the separation of research and production frameworks (PyTorch and Caffe2, respectively) allows us to operate flexibly on both sides, reducing the number of constraints while adding new features.

The ONNX. Deep Learning Tools ecosystem is still in its infancy in the entire industry. Different tools have different advantages for different subsets of problems, and there are different tradeoffs in terms of flexibility, performance, and support platforms, just like the tradeoffs we have described for PyTorch and Caffe2. Therefore, there is a great need to exchange training models between different frameworks or platforms. To make up for this shortcoming, at the end of 2017, Facebook and several partners jointly launched the Open Neural Network Exchange (ONNX). ONNX is a format that represents a deep learning model in a standard way to achieve interoperability between different frameworks and vendor-optimized libraries. At the same time, it meets the need to exchange trained models between different frameworks or platforms. ONNX is designed as an open specification that allows framework authors and hardware vendors to contribute to it and has various converters between the framework and the library. Facebook is working hard to make ONNX a collaborative partner between all of these tools, rather than an exclusive official standard.

Within Facebook, ONNX is the primary means by which we move our research model from the PyTorch environment to the high-performance production environment in Caffe2, which enables automatic capture of models and conversion of fixed parts.

Within Facebook, ONNX is the primary means by which we move our research model from the PyTorch environment to the high-performance production environment in Caffe2. ONNX provides the ability to automatically capture and transform the static part of the model. We have an extra toolchain that can be used to map models from Python to dynamic graphs by mapping them to control flow primitives in Caffe2 or reimplementing them with C++ as custom operators.

Resource requirements for machine learning

Given the different resource requirements, frequency, and duration of machine learning during the training and inference phases, we will discuss the details and resource applications of these two phases separately.

Facebook hardware resources overview

Facebook's infrastructure department has long since built an efficient platform for major software services, including customized server, storage, and network support for each major workload's resource requirements.

Introducing Facebook's software and hardware infrastructure for machine learning to meet its global scale computing needs

Figure 2 CPU-based computing server. There are four Monolake server cards on the single-socket server base and a two-socket server base on the two-socket server base, so there are three two-socket servers in the 2U chassis. So there are 12 servers in the 2U form combination.

Currently, Facebook offers about eight major computing and storage architectures, corresponding to eight major services. These major architectural types are sufficient to meet the resource requirements of Facebook's main services. For example, Figure 2 shows a 2U rack that can accommodate three compute Sleds modules that support two server types. One of the Sled modules is a single-socket CPU server (1xCPU), which is mostly used in the Web layer - a stateless service that mainly focuses on throughput, so you can use a more energy-efficient CPU (Broadwell-D processor); There are fewer DRAMs (32GB) and motherboard hard drives or flash memory.

Another Sled module is a larger two-socket CPU server (2x high-power Broadwell-EP or Skylake SP CPU) with a large amount of DRAM, often used for services involving large amounts of computing and storage.

Introducing Facebook's software and hardware infrastructure for machine learning to meet its global scale computing needs

Figure 3. Big Basin GPU server with 8 GPUs (3U rack)

As the neural network we train is getting bigger and deeper, we have developed a Big Basin GPU server (shown in Figure 3), which is our latest GPU server in 2017. The original Big Basin GPU server was equipped with eight interconnected NVIDIA Tesla P100 GPU accelerators, which used NVIDIA NVLink to form an eight-CPU hybrid cube grid, which was later refined and applied to the V100 GPU.

Big Basin is the successor to the earlier Big Sur GPU, the first widely used high-performance AI computing platform in the Facebook data center to support NVIDIA, developed in 2015 and released through the Open Compute Project. M40 GPU.

Compared to Big Sur, the V100 Big Basin delivers higher performance per watt of power, thanks to a single-precision floating-point unit—each GPU operating at 7 teraflops (teraflops per second) Increased to 15.7 teraflops and high bandwidth video memory (HBM2) with 900GB/s bandwidth. This new architecture also doubles the speed of half-precision operations, further increasing the throughput of the operation.

Because Big Basin's computational throughput is larger and memory is increased from 12 GB to 16 GB, it can be used to train models that are 30% larger than previous models. High-bandwidth NVLink interconnect GPU communication also enhances distributed training. In tests using the ResNet-50 image classification model, Big Basin's computational throughput is 300% higher than that of Big Sur, which allows us to train more complex models faster than ever before.

Facebook announced the design of all of these computing servers and several storage platforms through the Open Compute Project.

Resource requirements for offline training

Currently, different products use different computing resources to complete their respective offline training steps. Some products (such as Lumos) do all the training on the GPU. Other products, such as Sigama, do all the training on a two-socket CPU compute server. Products such as Facer use a two-stage training process that trains on the GPU with a small frequency (monthly) team-wide face detection and recognition model, and then on very high frequencies on thousands of 1x CPU servers. Specific training is performed on each user's model.

In this section, we will focus on the details of the various services around the machine learning training platform, training frequency and duration, and summarize them in Table II. In addition, we discuss the trends in data sets and the implications of these trends for computing, memory, storage, and network architecture.

Calculate the location of the type and relative data source. Offline training can be done either on the CPU or on the GPU, depending on the service itself. Although in most cases, the model trained on the GPU is better than the model trained on the CPU, the CPU's powerful off-the-shelf computing power makes it a very useful platform. This is especially true during the off-peak hours of the day, because CPU resources are not available during this time, as illustrated in Figure 4 below. Below we present the correspondence between the service and computing resource training models:

Training model services on the GPU: Lumos, speech recognition, language translation

Training model services on the CPU: News Feed, Sigma

Services for training models on GPUs and CPUs: Facer (a generic model that is trained every few years on the GPU, such models are more stable; user-specific models trained on 1x CPUs, such models can be used to process new image data ), search (using multiple independent vertical search engines, using the classifier that can make predictions to launch the most appropriate vertical search engine).

Currently, GPUs are primarily used for offline training, rather than providing real-time data to users. Because most GPU architectures are optimized for computational throughput to overcome the latency disadvantage. At the same time, because the training process relies heavily on data obtained from large data generation libraries, the GPU must be close to the data source for performance and bandwidth reasons. As the amount of data used by the training model grows quite fast, it is becoming increasingly important that the GPU is close to the data source.

Memory, Storage, and Network: From the perspective of internal memory capacity, both CPU and GPU platforms provide sufficient storage for training. Even for applications like Facer, you can train user-specific SVM models with 32GB of RAM on a 1x CPU. If you can make the most efficient use of the platform and the excess storage capacity, the overall training efficiency of the platform will be very good.

service Resource Training frequency Duration of training
News Feed Single slot CPUs Once a day Hours
Facer GPUs + single-slot CPUs Every N photos a few seconds
Lumos GPUs Once every few months Hours
search for Vertical Dependent Once per hour Hours
language translation GPUs once a week a few days
Sigma Dual slot CPUs Several times a day Hours
Speech Recognition GPUs once a week Hours

Table II Frequency, duration and resources of offline training for different services

Machine learning systems rely on training using instance data. Facebook uses a lot of data in the machine learning data pipeline. This makes computing resources tend to be close to the database.

Over time, most services will show a trend to take advantage of accumulated user data, which will make these services more dependent on other Facebook services and require more network bandwidth to capture data. Therefore, only large storage is deployed at or near the data source to transfer data from remote locations on a large scale, thereby avoiding shutting down the training pipeline in order to wait for more sample data.

When deploying the location of the training machine, we can also use this method to avoid excessive stress on the training resources of the training fleet.

The amount of data used by different services during offline training varies greatly. The training data set for almost all services shows a trend of continuous growth or even large growth. For example, some services use millions of rows of data before the ROI is reduced, others use tens of billions of rows of data (more than 100 terabytes) and are limited only by resources.

Scaling considerations and distributed training: The process of training neural networks involves the use of stochastic gradient descent (SGD) to optimize parameter weights. This method is used to fit the neural network and iteratively update the weights by evaluating a small subset of the tag instances (ie "batch" or "mini-batch"). In data parallelism, the network generates multiple model copies (parallel instances) to process multiple batches of data in parallel.

When using a machine to train the model, the larger or deeper the model will bring better training results and higher accuracy, but training such models often requires more samples to be processed. When training with a single machine, we can maximize the training by increasing the number of model copies and performing data parallelism across multiple GPUs.

As the amount of data required for training increases over time, hardware limitations can result in an increase in overall training delay and convergence time. However, we can use distributed training to overcome these hardware limitations and reduce latency. This area of ​​research is quite popular in Facebook and the entire AI research community.

A common assumption is that implementing data parallelism on different machines requires a specialized interconnection mechanism. However, in our research on distributed training, we found that an Ethernet-based network can provide approximately linear scalability. The ability to achieve an approximately linear extension is closely related to the size of the model and the network bandwidth.

If the network bandwidth is too small, it takes more time to perform parameter synchronization than to perform gradient calculations, and the advantages of data parallelism on different machines are greatly reduced. With a 50G Ethernet NIC, we can use the Big Basin server to extend the training of the visual model, and the synchronization between machines does not cause problems at all.

In all cases, updates require synchronization (each replica sees the same state), consistency (each replica generates the correct update), and performance (sublinear scaling) techniques to share with other replicas, which may affect Training quality. For example, translation services are currently unable to perform large-scale mini-batches training without degrading the quality of the model.

Conversely, if we use specific hyperparameter settings, we can train image classification models on very large mini-batch datasets and scale to more than 256 GPUs.

Experiments have shown that in a large service on Facebook, performing data parallelism on a 5x machine can achieve 4 times the training efficiency (for example, training a group of models with training time of more than 4 days, in the past, a total of 100 different models can be trained. The machine cluster now only trains the same 20 models per day, and the training efficiency is reduced by 20%, but the potential engineering progress wait time is reduced from 4 days to 1 day).

If the model becomes super large, then parallel training can be used to group and distribute the layers of the model to optimize training efficiency, and the activation unit can be passed between machines. Optimization may be related to network bandwidth, latency, or balancing internal machine limits. This increases the end-to-end delay of the model, so the enhancement of raw performance in each time step is usually associated with a decrease in step quality. This may further reduce the accuracy of the model at each step. The decrease in the accuracy of each step will eventually accumulate, so that we can get the optimal number of steps for parallel processing.

The design of the DNN model itself allows it to run on only one machine. In the inferential phase, splitting the model diagram between machines usually leads to a large amount of communication between the machine and the machine. But Facebook's main service will constantly weigh the pros and cons of the extended model. These considerations can determine changes in network capacity requirements.

service Relative computing power Calculation RAM
News Feed 100X Two-slot CPU high
Facer 10X Single slot CPU low
Lumos 10X Single slot CPU low
search for 10X Two-slot CPU high
language translation 1X Two-slot CPU high
Sigma 1X Two-slot CPU high
Speech Recognition 1X Two-slot CPU high

Table III Resource requirements for online reasoning services.

Resource requirements for online reasoning

In the line reasoning step after completing the offline training, we need to load the model into the machine and use the real-time input to run the model to generate real-time results of the website traffic.

Next we will discuss, an online inference model in practice - the ad ranking model. This model can filter thousands of ads and display ads from 1 to 5 in message feeds. This process is accomplished by progressively complex ranking operations on successively reduced subsets of advertisements.

Each round of operations uses a model similar to the Multilayer Perception Model (MLP), which includes a sparse embedded layer, and each round of operations reduces the number of ads. The sparse embedded layer requires a lot of memory, so when it comes to the latter operation, the model has more hyperparameters and it will run on a server that is independent of the MLP round.

From a computational point of view, the vast majority of online reasoning runs on a large number of 1xCPUs (single sockets) or 2xCPUs (two sockets). Since 1xCPU is more performant and more cost-effective for Facebook services, Facebook advocates using the 1xCPU server training model whenever possible.

With the advent of high-performance mobile hardware, Facebook can even run certain models directly on users' mobile devices to improve latency and reduce communication costs. However, some services that require a lot of computing and memory resources still require 2x CPUs for optimal performance.

Different products have different delay requirements when deriving the results of online reasoning. In some cases, the resulting data may be “excellent” and may be re-entered into the model after returning a preliminary quick assessment to the user. For example, it may be acceptable to classify a content as qualified in some cases, but this preliminary classification result will be overturned when running a more complex model.

Models such as ad rankings and message pushes have a stable SLA that can push the right content to the user. These SLAs determine the complexity and dependencies of the model, so if we have more powerful computing power, we can train more advanced models.

Machine learning data calculation

In addition to resource requirements, there are a number of important factors to consider when deploying machine learning in the data center, including the need for critical data and the reliability of natural disasters.

From getting data to model

Many of Facebook's machine learning models, the main factor for success, are extensive and high quality data available. The ability to quickly process and provide this data to machine learning models ensures that we deploy fast and efficient offline training.

For complex machine learning applications, such as advertising and ranking, the amount of data required for each training task is over hundreds of terabytes. In addition, the use of complex pre-processing logic ensures that data is cleaned and normalized for efficient migration and easier learning. These operations have very high resource requirements, especially for storage, network and CPU requirements.

As a general solution, we try to decouple the data in the training workload. Both workloads have very significant features. On the one hand, it is very complex, temporary, business-dependent, and fast-changing. On the other hand, training workloads are usually fixed (eg GEMM), stable (relatively small core business), highly optimized, and more preferred to work in a "clean" environment (eg, exclusive cache usage and minimal threading) Fight for).

为了优化这两者,我们在物理上对不同的机器的不同工作负载进行隔离。数据处理机器,又名“readers”,从存储器中读取数据,处理和压缩它们,然后将结果反馈给一个叫做“trainers”的训练机器。另一方面,trainers只专注于快速有效地执行任务。readers和trainers可以分布以便提供更灵活性和可扩展性的应用。此外,我们还优化了不同工作负荷的机器配置。

另一个重要的优化指标是网络使用。训练过程产生的数据流量非常重要的,并且有时候会突然产生。如果没有智能化处理的话,这很容易就会导致网络设备的饱和,甚至干扰到其他服务。为了解决这些问题,我们采用压缩优化,调度算法,数据/计算布局等等操作。

介绍Facebook在机器学习方面的软硬件基础架构,来满足其全球规模的运算需求

利用规模

作为一家为用户提供服务的全球性公司,Facebook必须保持大量服务器的设计能够满足在任何时间段内的峰值工作负载。如图所示,由于用户活动的变化取决于日常负荷以及特殊事件(例如地区节假日)期间的峰值,因此大量的服务器在特定的时间段内通常是闲置的。

这就释放了非高峰时段内大量可用的计算资源。利用这些可能的异构资源,以弹性方式合理分配给各种任务。这是Facebook目前正努力探索的一大机会。对于机器学习应用程序,这提供了将可扩展的分布式训练机制的优势应用到大量的异构资源(例如具有不同RAM分配的CPU和GPU平台)的机会。但是,这也会带来一些挑战。在这些低利用率的时期,大量可用的计算资源将从根本上导致分布式训练方法的不同。

调度程序首先必须正确地平衡跨越异构硬件的负载,这样主机就不必为了同步性而等待其他进程的执行。当训练跨越多个主机时,调度程序还必须要考虑网络拓扑结构和同步所需的成本。如果处理不当,机架内或机架间同步所产生的流量可能会很大,这将极大地降低训练的速度和质量。

例如,在1xCPU设计中,四个1xCPU主机共享一个50G的网卡。如果全部四台主机同时尝试与其他主机的梯度进行同步,那么共享的网卡很快就会成为瓶颈,这会导致数据包下降和请求超时。因此,网络之间需要用一个共同的设计拓扑和调度程序,以便在非高峰时段有效地利用闲置的服务器。另外,这样的算法还必须具备能够提供检查指向停止及随着负荷变化重新开始训练的能力。

灾难后恢复能力

能够无缝地处理Facebook一部分全球计算,存储和网络足迹的损失,一直是Facebook基础设施的一个长期目标。Facebook的灾难恢复小组会定期在内部进行演习,找出并补救全球基础设施中最薄弱的环节和软件堆栈。干扰行动包括在几乎没有任何通知情况下,进行整个数据中心离线处理以确认我们全球数据中心的损失对业务造成最小的干扰值。

对于机器学习的训练和推理部分,容灾的重要性是不言而喻的。尽管驱动几个关键性项目的推理过程十分重要这一观点以并不让人意外,但在注意到几个关键产品的可测量退化之前发现其对频繁训练的依赖性依然是一个的惊喜。

下文讨论了三种主要产品频繁机器学习训练的重要性,并讨论为适应这种频繁的训练所需要的基础架构支持,以及这一切是如何耦合到灾难后恢复性的。

如果不训练模型会发生什么?我们分析了三个利用机器学习训练的关键性服务,以确定那些不能频繁执行操作来训练更新模型(包括广告,新闻)和社区诚信所带来的影响。我们的目标是理解在失去训练模型能力的一个星期,一个月,六个月时间内所带来的影响。

第一个明显的影响是工程师的效率,因为机器学习的进度通常与频繁的实验相关。虽然许多模型可以在CPU上进行训练,但是在GPU上训练往往能够显著地提升模型性能。这些加速效果能够让模型迭代时间更快,以及并带来探索更多想法的能力。因此,GPU的损失将导致这些工程师生产力下降。

此外,我们确定了这个问题对Facebook产品的重大影响,特别是对频繁刷新其模型的产品。 我们总结了这些产品使用旧模型时出现的问题。

社交安全:创造一个安全的地方让人们分享和连接是Facebook的核心使命。 迅速而准确地检测攻击性内容是这项任务的核心。我们的社交安全团队十分依赖使用机器学习技术来检测攻击性的内容文字,图像和视频。攻击性内容检测是一种垃圾邮件检测的专门形式。对抗者会不断地寻找新的、创新性的方法来绕过我们的标识符,向我们的用户展示令人反感的内容。Facebook经常训练模型去学习这些新的模式。每次训练迭代都要花费几天的时间来生成用于检测令人反感的图像的精确模型。 我们正在继续推动使用分布式训练技术来更快地训练模型的边界,但是不完全的训练会导致内容退化。

消息推送:我们的发现并不令人惊讶,像消息推送样的产品对机器学习和频繁的模型训练依赖很大。在用户每次访问我们网站的过程中,为其确定最相关的内容,非常依赖先进的机器学习算法来正确查找和排列内容。与其他一些产品不同,Feed排名的学习方式分两步进行:离线步骤是在CPU / GPU上训练最佳模型,在线步骤则是在CPU上进行的持续在线训练。陈旧的消息推送模式对消息质量有着可量化的影响。消息推送团队不断在他们的排名模型上进行创新,并让模型模拟自身,进行几小时的不间断的训练,以此来推进模型的进步。而如果数据中心离线一个星期,带来的训练过程的损失计算就可能会阻碍团队探索新的能力模型和新的参数的进度。

广告:最令人惊讶的是频繁的广告排名模式的训练的重要性。寻找和展示合适的广告极其依赖机器学习及其创新。为了强调这种依赖的重要性,我们了解到,利用过时的机器模型的影响是以小时为单位来衡量的。 换句话说,使用一个旧的模型比使用一个仅训练一个小时的模型要糟糕得多。

总的来说,我们的调查强调了机器学习训练对于许多Facebook产品与服务的重要性。在日益增长的工作负荷面前,容灾工作不应该被低估。

容灾架构支持:上图展示了Facebook数据中心的基础架构在全球的分布情况。如果我们关注在训练和推理过程中CPU资源的可用性,那么我们将有充足的计算适应能力来应对几乎每个地区的服务器的损失需求。为GPU资源提供平等冗余的重要性最初被低估了。然而,利用GPU进行训练的初始工作量主要是计算机视觉应用程序和训练这些模型所需的数据,这些训练数据在全球范围内都能被复制得到。当GPUs刚开始被部署到Facebook的基础设施中时,对单一区域进行可管理性操作似乎是明智的选择,直到设计成熟,我们都可以基于他们的服务和维护要求来建立内部的专业知识。这两个因素的结果就是我们将所有生产GPU物理隔离到了另一个数据中心区域。

然而,在那之后会发生了几个关键的变化。由于越来越多的产品采用深度学习技术,包括排名,推荐和内容理解等,GPU计算和大数据的重要性将增加。此外,计算数据托管并复杂化是朝向一个巨型区域战略枢纽的存储方式。大型地区的概念意味着少数数据中心地区将容纳Facebook的大部分数据。顺便说一下,该地区所有的GPU群并没有驻留在大型存储区域。

因此,除了共同定位数据计算的重要性之外,思考什么可能情况将使我们完全失去了搭载GPU的区域,就变得尤为重要。而这个考虑的结果驱使我们为机器学习训练部署多样化GPU的物理位置。

未来的设计方向:硬件、软件和算法

随着模型复杂度和数据集规模的增长,机器学习的计算需求也随之增加。 机器学习工作负载反映了许多影响硬件选择的算法和数值的属性。

我们知道,卷积和中等尺寸的矩阵乘法是深度学习前馈和后向过程的关键计算内核。在批处理量较大的情况下,每个参数权重都会被更经常地重用,因此必须进行这些内核在算术强度(每个被访问内存字节的计算操作次数)方面的改进。提高算术强度通常会提高底层硬件的效率,因此在延迟的限制之内,以更大的数据批量运行是可取的做法。计算机器学习负载的上下限将有助于更宽的SIMD单元的使用,及专门的卷积或矩阵乘法引擎、专门的协处理器的部署。

在某些情况下,对每个节点使用小批量数据,在并发查询低的实时推理中或在训练过程中扩展到大量的节点的情况,也是必需的。小的batch规模通常会有较低的算术强度(例如,全连接层中矩阵的矢量乘法运算,本质上是有带宽限制的)。当完整模型不适合片上SRAM单元或最后一级缓存时,这种small batch数据可能会降低几个常见情况的性能。

这个问题可以通过模型压缩,量化和高带宽内存来缓解。模型压缩可以通过和/或量化稀疏来实现。训练期间的稀疏修剪连接(Sparsification prunes connections)会导致一个更小的模型。量化使用定点整数或更窄的浮点格式来压缩模型,而不是用于加权和激活的FP32(单精度浮点)。 目前已经有一些使用8位或16位的常用网络,被证明了相当的准确性。还有一些正在进行的研究工作是使用1或2位对模型权重进行压缩。除了减少内存占用量外,修剪和量化还可以通过降低带宽来加速底层硬件,并且允许硬件架构在使用固定点数进行操作时具有更高的计算速率,这比运行FP32值更有效。

缩短训练时间,加快模型传递需要分布式训练。正如在第四节B中讨论的那样,分布式训练需要对网络拓扑结构和调度进行仔细的协同设计,以有效利用硬件并实现良好的训练和质量。正如第III部分B节中描述的,分布式训练中最广泛使用的并行性形式是数据并行性,,这需要同步所有节点的梯度下降,要么同步要么异步。同步SGD需要全部减少操作(all-reduce operation)。 当使用递归加倍(和减半)执行时,全部减少操作呈现出来的一个有趣的属性是带宽需求随着递归级别呈指数级下降。

这鼓励了我们进行分层系统的设计,其中的节点在层次结构的底部形成超节点连接(例如,通过高带宽点实现点到点的连接或高位开关);在层次结构的顶部,超节点通过较慢的网络(例如,以太网)连接。换句话说,异步SGD(不等待其他节点处理批处理)更难,通常需要通过共享参数服务器完成; 节点将其更新发送到参数服务器,该服务器聚合并将其分发回节点。为减少更新的陈旧程度并减轻参数服务器的压力,混合设计可能是个好的思路。在这样的一个设计中,异步更新发生在超节点内具有高带宽和低延迟连接性的本地节点,而同步更新发生在超节点之间。进一步增加扩展性需要增加批量的大小而不是以牺牲收敛性为代价。不管是在Facebook内部还是外部,这都是一个很活跃的算法研究领域。

正如第II部分所描述的,在Facebook上我们的使命是为那些基于机器学习的应用程序构建高性能,节能的机器学习系统。我们正在不断评估和构建新颖的硬件解决方案,并保持算法的变化及其对系统潜在影响的关注。

in conclusion

基于机器学习的工作负载重要性的增加正在影响到系统堆栈的所有部分。对此,计算机体系结构社区如何做出最好的应对策略以及由此产生的挑战将成为大家越来越关注的一个话题。虽然以前我们已经围绕有效地处理ML训练和推理的必要计算进行了大量工作,但是考虑到解决方案在应用到大规模数据时将出现挑战,情况就会发生改变。

在Facebook,我们发现了几个关键因素,这些因素在规模化以及驱动数据中心基础架构的设计决策时非常重要:数据与计算机协同定位的重要性,处理各种机器学习工作负载(不仅仅是计算机视觉)的重要性,由非高峰期的每日循环产生的剩余容量而带来的机会。我们在设计包含定制设计的,易于使用的开源硬件的端到端解决方案,以及平衡性能和可用性的开源软件生态系统时,考虑了上述每一个因素。这些解决方案现在正在服务超过21亿人的大规模机器学习工作负载提供强大的动力,同时也体现了相关专家在跨越机器学习算法和系统设计方面所付出的努力。

New Energy Cable Assembly

New energy Cable Assembly is divided into new energy vehicle wiring harness ( Electric Vehicle Wiring Harness ) and Energy Storage Wire Harness. Energy refers to the energy that has just begun to be developed and utilized or is being actively researched and yet to be promoted, such as solar energy, geothermal energy, wind energy, ocean energy, biomass energy and nuclear fusion energy. With the decreasing of energy, the use of new energy is the general trend. The new energy industry is a very potential industry in terms of the current situation, and the demand for new energy wiring harnesses will also increase sharply in the future.
Kable-X has passed ISO9001 quality system certification, UL certification, ISO13485 medical quality system certification, ISO/TS16949 automotive quality system certification, etc. How to implement these system certifications into the actual production process? Harness product quality control involves design, Production, procurement, quality and other departments must create a team quality culture. Product quality is the life of the company. Through meetings, problem discussions and other forms, employees are constantly instilling quality awareness. We have weekly meetings on Mondays, morning meetings for production line teams, and customers. Regular product meetings and so on. Only through continuous learning, improve the quality awareness of team members, and improve the execution of the entire team, can we continuously improve the level of quality control. There is no end to quality control, there is no best, only better.

Only through strict waterproof wiring harness design, purchase of waterproof wiring harness produced by regular manufacturers, production in strict accordance with waterproof wiring harness technology, and rigorous and thorough quality inspections, can we produce qualified defensive wiring harness.

We produce a large number of Renewable Energy Cable Assembly, among which BMS Sampling Wire Harness and Energy Storage Wire Harness are loved by many customers. In addition, we also produce Medical Cable Assembly and Renewable Energy Cable Assy.

renewable


New Energy Cable Assembly,New Energy Cable Assy,New Energy Cable Harness Assembly,New Energy Medical Cable Assemblies

Kable-X Technology (Suzhou) Co., Ltd , https://www.kable-x-tech.com

Posted on