Facebook days after the "data scandal" last year were not good. After that, most of the keywords they emphasized to the outside world were "privacy" and "security." Even so, at the recent Facebook F8 conference, Zuckerberg couldn't help but laugh at himself, and many people still don't trust Facebook because of data privacy issues.
However, not just a Facebook company, how to use the existing technology such as AI to protect users from harm is the most The problem that must be constantly explored. For world-class companies that have experienced a year of turmoil, Facebook's efforts to focus on data privacy and platform security are also obvious.
Facebook CTO MIke Schroepfer and research scientist Manohar Paluri of Facebook AI gave a keynote speech at the recent F8 conference. They mainly talked about how to use AI technology to protect platform users from using products safely. Two things need to be done: 1. Understand the content; 2. How does Facebook use self-supervised learning methods to improve the accuracy of content recognition, and reduce the tag data in applications such as translation, NLP, and image recognition. Requirements.
Yann LeCun commented on this, which helps improve speeches with violent images, hatred, interference with elections, error messages, bots, etc. Filtration of offending content.
Aside from the suspicion of Facebook, we should probably look at what it has done technically.Technical practices such as their AI may also have important technical implications for other companies in protecting user data and experience.
The specific technical details are in the following full speech:
AI on Facebook One of the most ubiquitous applications is to help users on our platform use it safely.
In order to make all of these systems more effective, we need to continue to improve AI technology in two ways: understanding the content and working with a small amount of tag training data.
Our recent progress on NLP and CV shows how work on content understanding can yield benefits. In the NLP space, we developed a shared multi-language embedding space that can be used as a common language for handling harmful content.This is true even in resource-poor languages. In the CV space, based on industry-leading research, we can identify more of the content in the image and use tags to achieve record-accurate accuracy for video understanding.
As our ability to understand content continues to improve in different models, we have made progress at the new front of self-monitoring technology. This technology will accelerate learning through pre-training systems and can be the underlying technology for the next generation of faster, more flexible tools.
We will focus on how Facebook can improve the accuracy and efficiency of the content understanding system and find new ways to do more with less supervised learning methods. method.
First, use multi-language sentence embedding to handle offending content
To detect when people are The offending content was released and our system needs to understand the language. Specifically,Our system uses machine learning to scan a given sentence and answer a series of questions, such as "Is it harmful (hateful)?" Answers to these questions As well as interactive contexts and other signals, we can determine if the system is taking action, such as marking it to a human auditor.
In order for the ML system to answer these questions, we need to train thousands of examples in a given language. There are approximately 6,500 languages in the world, including languages that currently lack a large training data set, and finding enough examples to develop a content understanding system that supports all languages is a huge challenge.
By mapping similar sentences in multiple languages in a shared embedded space, we can better understand the relevant content without having to translate each sentence.
In order to To help address the scarcity of training data, we are using our recently open source toolkit LASER (Language-Agnos) to understand a large number of languages by training a single model. Previously we needed Preparing different models for each language, LASER's representation space allows us to train a language model and then apply the model to a range of languages without the need for specific language training data or translations. This is called "zero." Zero-shot transfer learning. LASER also allows us to identify sentences that are similar in meaning by mapping these sentences to each other in a language-unknown representation space.
LASER open source address: https://github.com/facebookresearch/LASER
For researchers who want the system to increase the number of languages they understand, such a Cross-language technology provides a more scalable alternative to try to collect and annotate data for each language. This approach also allows us to mine parallel training data for machine translation and for low data resource languages ( Our training examples are few and far more useful. Identifying similar sentences across languages helps to capture similar violations in multiple languages at the same time. To generate the embedding of each sentence level, we first use the byte pair encoding to represent the given sentence. Words, then use a five-layer bidirectional LSTM (long-short-term memory) model, followed by a max pooling operation (since the sentence contains any number of words).
Train this system on a large scale - 93 languages,With more than 30 languages and written in 22 different scripts, we are able to obtain language-independent sentence embedding and the ability to support automatic detection of violations is especially relevant for low resource languages.
This approach, along with our cross-language pre-training study, will enhance our ability to handle hate speech, bullying and other irregularities in multiple languages, and Training data without additional language tags. Both of these technologies will support our existing multilingual lexical embedding, which maps similar words from different languages into the same space (as opposed to LASER's sentence-level mapping). These embeds have been deployed into production for a wide range of cross-language comprehension tasks including identifying offending content.
Second, Panorama FPN: The latest technology for picture and video understanding
People in our By sharing billions of images on the platform, understanding the content is critical to protecting people's safety.Even simple pixel analysis may be enough for our system to identify individual objects in a picture, and we can even push the industry-leading CV capabilities and let the system understand the connections between these objects to determine violations.
(Note: Recently, the "panoramic segmentation" task proposed by the He Yuming team began to become popular. In January of this year, they announced the Panoptic Feature Py Ramid Networks" paper.)
Article link: https://arxiv.org/abs/1901.02446
Our system is good at identifying objects in the foreground of a picture, such as a dog or a ball, but it is still difficult to understand the background of a larger image that contains fewer sets of pixels. Use the new FPN (Panoptic FPN) object recognition method,We can perform instance segmentation tasks (for foreground) and semantic segmentation tasks (for background) on a unified neural structure.
Over the years, Facebook's CV system has gradually identified more image components, and now it is possible to detect objects in the foreground and background through a single network. This provides a better understanding of the overall background of the photo, as well as image recognition with higher computational efficiency.
Facebook's practice results show that panoramic FPN can almost achieve the overall computational efficiency required to perform instance and semantic segmentation compared to just one or other networks. Increase by half. In practice, the system can better understand the image, which is important in determining whether it is a violation.But this work can also affect other applications, such as the ability to automatically change the text we use to describe images to the visually impaired.
The difficulty in the video is orders of magnitude greater than the lookup violations in the image. Understanding video means considering the large number of images that make up a given sequence of frames and the movements represented in that sequence, while also dealing with non-visual inputs, such as audio.
Because of this challenge, video understanding is still in its infancy. We consistently drive state-of-the-art technology in terms of accuracy and efficiency, in part by focusing the system's attention and training on the most relevant data. For example, by decomposing 3D volume integrals into 2D and 1D convolutions (respectively related to space and time in a given video sequence), we reduce the number of trainable parameters. Or we can keep the same number of parameters and improve accuracy. In summary, using this framework, we can find a balance between accuracy and efficiency.
Different from passing each frame in a given video to a spatiotemporal convolutional neural network, our significant sampling method is to isolate the video containing significant motion Open for further processing.
In order to understand what is happening in the video, we break it down into short segments (each segment consists of a small number of consecutive frames) and pass our latest space-time model Send a group of consecutive frames. We can then aggregate this information to predict the entire video content.
However, in many videos, only a few clips have saliency information for a particular task,The remaining segments are redundant or irrelevant, such as detecting bullying video. Therefore, to further increase the speed and efficiency of finding actionable events in the video, we created a significant sampler. The system is trained to focus on the parts that contain specific behaviors and then process these sets of frames in more detail. This more targeted analysis and training enables faster and more accurate video understanding of content.
Third, use the label for record accuracy of video understanding
We also develop A different approach sets new technical approaches to identifying behavior, including pointing out content violations.
This technology is directly based on the research we published last year at the F8 Conference (May 2018), which uses dozens of tags Billion public images to train the network and beat the most advanced technology in image recognition missions. In our new approach, tagged data acts as weakly supervised data.This means that the trained training examples are available, but this is not fully supervised.
The annotations thus obtained are noisy and inaccurate compared to tags designed to train AI models. However, the number of tagging examples provided by this approach suggests that we can significantly improve video understanding based on unprecedented amounts of training data rather than through weakly supervised training data.
In this case, the most big data set we train contains more than 65 million public Instagram videos with tags. In contrast, the current action classification data set contains only hundreds of thousands of videos. The technical challenges of using these videos are similar to the image recognition work of billions of times, for example, there must be distributed training on the hardware, as well as new challenges, including the fact that only tags that are usually only applied to a small portion of the video are processed.For example, a video labeled #wedding and #dance might just be a couple of newlyweds spending a few seconds dancing in a long-time video.
Despite this random noise problem, we found that the diversity of content and the absolute size of the example offset the tag noise. Using our saliency sampler, the video recognition model achieves the most advanced precision in three major video classification benchmarks. This includes achieving 82.8% accuracy on the kinetic dataset when classifying the video into one of 400 different human behavior categories, which is 5.1% more accurate than other most advanced technologies, and the error rate is reduced more than 25%. We have applied this method to production systems, increasing the bullying detection rate to nearly 85%.
You can get better results by incorporating audio into this model. Our experiments prove that compared to visual models that use the same architecture and training process,Our audio and video models have created new records in the AudiOset audio event detection benchmark—increasing accuracy by 20% in detecting defamatory content and adult content.
Four, application prospects of self-monitoring methods in content understanding
Language, images and videos Understanding is part of Facebook's ongoing efforts. But as we focus on the long-term task of maintaining platform security, it will become increasingly important to create systems that can be trained with large amounts of unlabeled data.
Most of our systems today rely on supervised training, but this can lead to a range of training challenges, such as lack of training data, in collecting And tagging examples to build a new classifier's long training time from scratch, as new content violations quickly ferment, events such as elections have become a point of outbreak of harmful content, and we have a responsibility to accelerate system development,Thereby improving responsiveness.
One possible answer is the self-supervised approach that Facebook’s chief AI scientist Yann LeCun has been discussing for years, not just relying on tagging data for human training purposes. Or even rely on weakly supervised data for images and videos with public tags. Self-supervised methods can take advantage of completely unmarked data, and the approach is versatile, enabling self-supervised systems to use a small amount of tagged data to summarize invisible tasks and possibly bring us closer to achieving AI technology goals for human-level intelligence.
Basically, the research strategy of the Facebook AI team has recently been transformed into a system that delivers powerful results, and some self-supervised language understanding models continue to lead the way. A traditional, fully supervised method of training systems.
In particular, we have developed some models,Learning to predict a portion of a given signal by training the rest of the signal. For example, we train one of the self-monitoring systems to better understand the language by masking the words in the sentence, even if the model has never seen the exact sentence before.
giving a short sentence like "A conversation about ________ and human connection", people can easily guess There are a few words that can fill the gap, but this task is more challenging for AI. This is the basis for a useful and extensible training task, similar to the BERT model introduced by Google at the same time to solve the task. We can empty each word of a sentence in turn, and repeat the process for a billion words, which of course does not need to be marked.
by separately analyzing the context of the left and right sentences of the masked word, Our two-way transformation model predicts missing words without relying on tagged data.
To predict each hidden word, we use a bidirectional transformation network (bidirectional transformer networks), which simulates the rest of the sentence by calculating the state before and after the sentence (the words on the right and left sides of the mask), and then combines these representations to determine the central word. Once the system is in this untagged manner With training, we can use tag data to fine-tune specific tasks, such as to identify hate speech.
In internal testing,This mixture of self-monitoring and supervised training allows us to train the accuracy of the full-supervised model with less than 10 times the data, or use the same amount of training data, which can be reduced by 20% compared to the fully supervised model. error.
We also use self-monitoring to improve speech recognition capabilities. We created several versions of an audio clip and changed some of the audio content, and the model had to use only the original audio as input to determine which version is correct, as well as no transcription or use of other tags.
For this approach, we use two networks that are stacked together: the original audio is mapped to a low-frequency feature representation of the encoder network, and a contextual network that predicts the correct audio. In order to make the task more effective, we further predict the future through the context network, making forecasting problems more difficult.
After pre-training the original, unlabeled audio data using two convolutional neural networks, the system is optimized to solve an increasingly difficult task: predicting audio at different times, The arrow indicates further predictions in the future.
Once this pre-trained, self-monitoring model understands speech well, we use a small amount of supervised data: 80 hours of transcribed audio to train The final speech recognition system. Our system uses up to 150 times less tag data than the best system Deep Speech 2, while reducing the word error rate by 9%. This work allows us to quickly extend speech recognition to more languages.And each language does not require a lot of transcriptional speech.
both methods focus on speech and language understanding, but they also represent a more fundamental approach to how we explore and even combine different levels of data monitoring. . This includes the use of large amounts of unlabeled training data and the potential to release self-supervised systems with a small amount of tagged data. In all AI-related tasks, emphasizing self-monitoring can speed up these tasks, but none of the tasks is more important than improving the safety of those who use our products.