## SEER: Better, fairer computer vision through self-supervised training on uncurated internet images
_**SEER 2022**_: [[arXiv](https://arxiv.org/abs/2202.08360)], [[blogpost](https://ai.facebook.com/blog/seer-10b-better-fairer-computer-vision-through-self-supervised-learning-training-on-diverse-datasets)], _SEER 2021_: [[arXiv](https://arxiv.org/abs/2103.01988)], [[blogpost](https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision/)], [[Featured in the State of AI Report 2021](https://www.stateof.ai/)]
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/copy-detection-on-copydays-strong-subset)](https://paperswithcode.com/sota/copy-detection-on-copydays-strong-subset?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/fine-grained-image-classification-on-sun397)](https://paperswithcode.com/sota/fine-grained-image-classification-on-sun397?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/image-classification-on-food-101-1)](https://paperswithcode.com/sota/image-classification-on-food-101-1?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/image-classification-on-dtd)](https://paperswithcode.com/sota/image-classification-on-dtd?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/image-classification-on-clevr-count)](https://paperswithcode.com/sota/image-classification-on-clevr-count?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/image-classification-on-clevr-dist)](https://paperswithcode.com/sota/image-classification-on-clevr-dist?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/image-classification-on-places205)](https://paperswithcode.com/sota/image-classification-on-places205?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/meme-classification-on-hateful-memes)](https://paperswithcode.com/sota/meme-classification-on-hateful-memes?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/image-classification-on-inaturalist-2018)](https://paperswithcode.com/sota/image-classification-on-inaturalist-2018?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/image-classification-on-objectnet)](https://paperswithcode.com/sota/image-classification-on-objectnet?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/domain-generalization-on-imagenet-a)](https://paperswithcode.com/sota/domain-generalization-on-imagenet-a?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/domain-generalization-on-imagenet-r)](https://paperswithcode.com/sota/domain-generalization-on-imagenet-r?p=vision-models-are-more-robust-and-fair-when)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-models-are-more-robust-and-fair-when/domain-generalization-on-imagenet-sketch)](https://paperswithcode.com/sota/domain-generalization-on-imagenet-sketch?p=vision-models-are-more-robust-and-fair-when)
## About SEER
SEER is (a family of) **SE**lf-sup**ER**vised computer vision model trained on billions of uncurated images from internet.
![seer overview](static/img/SEER_overview_4.jpg)
The biggest model is 10 billion dense parameters and leads to [fairer, less harmful and less biased models](#results-fairness). As model size increases, performance drastically improves on [fairness benchmarks](http://arxiv.org/abs/2202.07603) across different gender, apparent skin tone, and age groups.
We also validate the performance on _more than 50 computer vision benchmarks_. Despite training on random collections of images on the internet with no data curation, the 10B model [outperformed SOTA supervised and self-supervised models trained on ImageNet on 70 percent of the benchmarks](#results-image-classification) while achieving competitive or equal performance on the rest.
Further, this model understands images from across the world well enough to [geolocalize them with unprecedented precision](#salient-property-geolocalization). The model also discovers salient properties in the data such as [multi-lingual hashtags embedding of similar concepts](#salient-property-multilingual-hashtag-word-cloud) such as wedding even though the model is trained only on the images themselves with no location information or other metadata, it is able to group together the same concepts in multiple languages all over the world.
On top of achieving strong performance on standard computer vision benchmarks, the model also excels at challenging tasks and increases [robustness to out-of-domain generalization](#results-out-of-domain-generalization). For example, it can correctly identify animals in sketches and artistic renditions and also handles challenges in images such as camouflage, blur, occlusion, motion, and unusual perspectives.
## Pretrained Models Weights
All SEER models are pretrained on 1 Billion _random_ and _uncurated_ subset of Instagram images.
| Model | Architecture | Parameters | Pretrained weights | IN1K finetuned weights | ImageNet finetuned Top1 (%) 384px | iNaturalist18 finetuned Top1 (%) 384px | Places205 finetuned Top1 (%) 384px | CopyDays Copy Detection mAP |
| ----- | -------------- |:-----------:|:----------------:|:--------------------:|:-------------------:|:---------------------:|:------------------:|:---------------------------:|
| SEER | RG-32Gf | 156M | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/seer_regnet32d/seer_regnet32gf_model_iteration244000.torch) | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/seer_finetuned/seer_regnet32_finetuned_in1k_model_final_checkpoint_phase78.torch) | 83.4 | 79.1 | 67.5 | 83.0 |
| SEER | RG-64Gf | 250M | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/seer_regnet64/seer_regnet64gf_model_final_checkpoint_phase0.torch) | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/seer_finetuned/seer_regnet64_finetuned_in1k_model_final_checkpoint_phase78.torch) | 84.0 | 80.0 | 67.4 | 83.2 |
| SEER | RG-128GF | 693M | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/swav_ig1b_regnet128Gf_cnstant_bs32_node16_sinkhorn10_proto16k_syncBN64_warmup8k/model_final_checkpoint_phase0.torch) | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/seer_finetuned/seer_regnet128_finetuned_in1k_model_final_checkpoint_phase78.torch) | 84.5 | 82.6 | 67.5 | 86.5 |
| SEER | RG-256GF | 1.5B | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/swav_ig1b_cosine_rg256gf_noBNhead_wd1e5_fairstore_bs16_node64_sinkhorn10_proto16k_apex_syncBN64_warmup8k/model_final_checkpoint_phase0.torch) | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/seer_finetuned/seer_regnet256_finetuned_in1k_model_final_checkpoint_phase38.torch)| 85.2 | 83.9 | 67.7 | 87.8 |
| SEER | RG-10B | 10B | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/seer_regnet10B/model_iteration124500_conso.torch) | [model](https://dl.fbaipublicfiles.com/vissl/model_zoo/seer_finetuned/seer_10b_finetuned_in1k_model_phase28_conso.torch) | **85.8** | **84.7** | **69.0** | **90.6** |
## Model Documentation & License
We provide [model documentation]((./MODEL_DOCUMENTATION.md)) detailing how SEER was created and its intended uses. The use of SEER model weights is subject to the [Model License](./MODEL_LICENSE).
## Citation
```BibTeX
@article{goyal2022vision,
title={Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision},
author={Priya Goyal and Quentin Duval and Isaac Seessel and Mathilde Caron and Ishan Misra and Levent Sagun and Armand Joulin and Piotr Bojanowski},
year={2022},
eprint={2202.08360},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## Results: Fairness
SEER is trained on public Instagram images without any data curation and leads to more fair and less harmful models.
- Compared to conventional ImageNet dataset, SEER training dataset is more diverse and naturally represents images from ~200 countries as shown in the top-left figure.
- SEER models show better object recognition performance when evaluated on geographically diverse data. The performance improves the most for low to mid income and non-Western countries (bottom-left) reducing disparity in object recognition across the world. We demonstrate the qualitative results also.
- SEER models reduce harmful label associations (top-right) and gender disparity (bottom-right) when compared to supervised pretraining on INet-1K with larger models being more fair than smaller models.
We illustrate qualitative geographical diversity results on the DollarStreet dataset, with a fixed architecture Regnet128Gf, where SEER models performance is noticeably better than models pretrained on INet-1K (such as SwAV self-supervised pretraining or supervised pretraining).