AI METRICS DATA 

########

Game score on a selected index of Atari 2600 games [Month, Year:  Games, Scores]. Compiled by Miles Brundage & Jack Clark.

Why we care: Reflects the improvement in reinforcement learning algorithms in mastering dynamic environments.

December, 2013:

Breakout - 225
Enduro - 661
Pong - 21
Q*Bert - 4500 
Seaquest - 1740 
S. Invaders - 1075

Paper: Playing Atari with Deep Reinforcement Learning
Url: https://arxiv.org/pdf/1312.5602.pdf

February, 2015:

Breakout - 401
Enduro - 301
Pong - 18.9
Q*Bert - 10596
Seaquest - 5286
S. Invaders - 1976

Paper: Human Level Control Through Deep Reinforcement Learning
Url: https://www.semanticscholar.org/paper/Human-level-control-through-deep-reinforcement-Mnih-Kavukcuoglu/340f48901f72278f6bf78a04ee5b01df208cc508

September, 2015:

Breakout - 375
Enduro - 319
Pong - 21
Q*Bert - 14875
Seaquest - 7995
S. Invaders - 3154

Paper: Deep reinforcement learning with double q-learning 
Url: https://pdfs.semanticscholar.org/3b97/32bb07dc99bde5e1f9f75251c6ea5039373e.pdf?_ga=1.165640319.1334652001.1475539859

November, 2015:

Breakout - 345
Enduro - 2258
Pong - 21
Q*Bert - 19220
Seaquest - 50254
S. Invaders - 6427

Paper:  dueling network architectures for deep reinforcement learning 
Url: https://pdfs.semanticscholar.org/13b5/8f3108709dbbed5588759bc0496f82a261c4.pdf?_ga=1.123524811.1334652001.1475539859

June, 2016:

Breakout - 766.8
Enduro - -82.5
Pong - 10.7
Q*Bert - 21307
Seaquest - 1326.1
S. Invaders - 23846.0

Paper: Asynchronous methods for deep reinforcement learning 
Url: https://arxiv.org/pdf/1602.01783.pdf

#########

Word error rate on Switchboard (specify details): [Month, Year: Score [SWB]: Team].  Compiled by Jack Clark._

A note about measurement: We're measuring Switchboard (SWB) and Call Home (CH) performance from the Hub5'00 dataset, with main scores assesses in terms of word error rate on SWB. 

Why do we care: Reflects the improvement of audio processing systems on speech over time.

2011:

16.1%:
16.1% SWB
Who: Microsoft
Technique: Microsoft: CD-DNN (Context Dependent Deep Neural Network)

Paper: Conversational Speech Transcription Using Context-Dependent Neural Networks 
Url: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/CD-DNN-HMM-SWB-Interspeech2011-Pub.pdf

2012:

April, 2012: 
18.5% SWB
Who: University of Toronto, Google, IBM, Microsoft

Paper: Deep Neural Networks for Acoustic Modeling in Speech Recognition
Url: https://pdfs.semanticscholar.org/ce25/00257fda92338ec0a117bea1dbc0381d7c73.pdf?_ga=1.195375081.452266805.1483390947

2013:

2013: 
12.9% SWB
12.9% SWB, 24.5% CH, 18.7% 
Who: Brno University of Technology, University of Edinburgh, Johns Hopkins

Technique: DNN BMMI  (deep neural network boosted minimum mutual information) 
Paper: Sequence discriniative training of deep neural networks 
Url: http://www.danielpovey.com/files/2013_interspeech_dnn.pdf

2014:

June, 2014: 
16.0% SWB
16.0% SWB, 23.7% CH, 19.9% EV
Who: Stanford

Paper: Increasing Deep Neural Network Acoustic Model Size For Large Vocabulary Continuous Speech Recognition
Url: https://arxiv.org/abs/1406.7806v1

December, 2014:
20.0% SWB 
12.6% SWB + FSH
Deep Speech SWB: 20.0% SWB, 31.8% CH, 25.9% blended.
Deep Speech SWB + FSH: 12.6% SWB, 19.3% CH, 16.0% blended.
Who: Baidu

Paper: Deep Speech: Scaling up end-to-end speech recognition
Url: https://arxiv.org/abs/1412.5567

2015:

May, 2015: 
8%: SWB
8.0% SWB, 14.1% CH, 11.0% blended.
Who: IBM

Paper: The IBM 2015 English Conversational Telephone Speech Recognition System
Url: https://arxiv.org/abs/1505.05899

2016: 

June, 2016: 
6.9% SWB
n-gram + model M + NNLM: 6.9% SWB, 12.5% CH, 9.7% blended.
Who: IBM

Paper: The IBM 2016 English Conversational Telephone Speech Recognition System
Url: https://arxiv.org/abs/1604.08242v1

September, 2016: 

6.2% SWB
Single model (ResNet): 6.9% SWB, 13.2% CH, 10.05% blended.
Combination: 6.2% SWB, 12.0% CH, 9.1% blended 
Who: Microsoft

Paper: The Microsoft 2016 Conversational Speech Recognition System
Url: https://arxiv.org/abs/1609.03528

October, 2016: 

5.9% SWB
Single model (ResNet): 6.6% SWB, 12.5% CH, 9.55% blended. 
Combination: 5.9% SWB, 11.1% CH, 8.5% blended.
Who: Microsoft

Paper: Achieving human parity in conversational speech recognition
Url: https://arxiv.org/abs/1610.05256

March, 2017:
5.5% SWB
5.5% SWB, 10.3% CH
Who: IBM

Paper: English Conversational Telephone Speech Recognition by Humans and Machines
Url: https://arxiv.org/abs/1703.02136

#########

Image Classification on ImageNet [Year, Classification Error, Team]. Compiled by Jack Clark.

Why we care: This dataset provides a good measure of image classification accuracy over time, letting us understand the accuracy with which computers - given sufficient data - can see the world.

2010: 
0.28191
NEC UIUC

http://image-net.org/challenges/LSVRC/2010/results

2011: 
0.25770
XRCE

2012: 
0.16422
Supervision

http://image-net.org/challenges/LSVRC/2012/results.html

2013: 
0.11743
Clarifai

http://www.image-net.org/challenges/LSVRC/2013/results.php

2014: 
0.07405
VGG

http://image-net.org/challenges/LSVRC/2014/index
 
2015: 
0.03567
MSRA

http://image-net.org/challenges/LSVRC/2015/results

2016: 
0.02991
Trimps-Soushen

http://image-net.org/challenges/LSVRC/2016/results

#########

Generative models of CIFAR-10 Natural Images [Year: bits-per-subpixel, method]. Compiled by Durk Kingma.

Why we care:
(1) The compression=prediction=understanding=intelligence view (see Hutter prize, etc.). (Note that perplexity, log-likelihood, and #bits are all equivalent measurements.)
(2) Learning a generative model is a prominent auxiliary task towards semi-supervised learning. Current SOTA semi-supervised classification results utilize generative models.
3) You're finding patterns in the data that let you compress it more efficiently. Ultimate pattern recognition benchmark because you're trying to find the patterns in all the data. 

2014: 
4.48
Method: NICE

Paper: NICE: Non-linear independent components estimation. 
https://arxiv.org/abs/1410.8516

2015:
4.13
Method: DRAW

Paper: Draw: A recurrent neural network for image generation.
https://arxiv.org/abs/1502.04623

2016: 
3.49
Method: Real NVP

Paper: Density estimation using real NVP. 
https://arxiv.org/abs/1605.08803

2016: 
3.11
Method: VAE with IAF

Paper: Improving variational inference with inverse autoregressive flow. 
https://arxiv.org/abs/1606.04934

2016: 
3
Method: PixelRNN

Paper: Density estimation using real NVP. 
https://arxiv.org/abs/1605.08803v2

2016: 
2.92
Method: PixelCNN

Paper: PixelCNN++: A PixelCNN implementation with discretized logistic mixture likelihood and other modifications. 
URL: https://arxiv.org/abs/1701.05517

#########

Perplixity on Penn Treebank [Year, perplexity score, people involved]. Compiled by Jack.

Why do we care: Perplexity gives us a sense of how well the computer has been able to model language. Lower perplexity indicates less confusion on the part of the model about what language to use to complete a sentence. 

2012: 
124.7
mikolov & zweig
https://pdfs.semanticscholar.org/04e0/fefb859f4b02b017818915a2645427bfbdb2.pdf

2013: 
107.5
pascanu test
how to construct deep RNNs, https://arxiv.org/abs/1312.6026

2014: 
78.4
zaremba et all.
https://arxiv.org/abs/1409.2329

2015: 
73.4
variational LSTM (large, untied, MC)
https://arxiv.org/pdf/1512.05287v5.pdf

2016: 
70.9
pointer sentinel-lstm
https://arxiv.org/pdf/1609.07843v1.pdf

2016: 
66
recurrent highway networks
https://arxiv.org/pdf/1607.03474v3.pd (https://arxiv.org/pdf/1607.03474v3.pdf)

#########

**_Bits-per-character on enwik8 dataset to measure Hutter Prize compression progression_**
[Year, bits-per-character]. Compiled by Jack

Why we care about this: relationship between compression and intelligence?

2011: 1.60
Paper: Generating text with recurrent neural networks
Url: 
http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf

**2013:**

2013: 1.67
Paper: Generating sequences with recurrent neural networks
Url: https://arxiv.org/abs/1308.0850

**2015:**

February, 2015: 1.58 
Paper: Gated Feedback Recurrent Neural Networks
Url: https://arxiv.org/abs/1502.02367

July, 2015: 1.47 
Paper:  Grid Long Short-Term Memory
Url: https://arxiv.org/abs/1507.01526

**2016:**

July, 2016: 1.32
Recurrent highway networks
https://arxiv.org/abs/1607.03474

September, 2016:  1.32
Paper: Hierarchical Multiscale Recurrent Neural Networks
Url: https://arxiv.org/abs/1609.01704

September, 2016: 1.39
hypernetworks / hyperlstm
https://arxiv.org/abs/1609.09106

October, 2016: 1.37
Paper: surprisal-driven feedback in recurrent neural networks, 
Url: https://arxiv.org/pdf/1608.06027.pdf

2016 - surprisal driven zoneout, test: 1.313
https://pdfs.semanticscholar.org/e9bc/83f9ff502bec9cffb750468f76fdfcf5dd05.pdf?_ga=1.27297145.452266805.1483390947

**Human/Algo best performance:**

**cmix v11, test 1.245 BPC**
CMIX (SOTA) uses neural nets. Both LSTM and fully connected. Uses 1,746 independent models, majority of which come from other OSS compression progs eg paq8l, paq8pxd, paq8hp12
 http://www.byronknoll.com/cmix.html

* * *
**_Task completion and average error score for Facebook's BAbI dataset to measure limited question answering. Compiled by Jack._**
[Year, failed tasks#, mean error%, paper]
 
[note: a task is defined as failed if error is higher than 5%. Mean error is error across all tasks ]

**Why we care about this:** helps us understand how well AI can be trained to deduce facts from a large dataset corpus. 
(Measurement note - the pathfinding and positional reasoning Qs are much harder for AI than other qs, so watching this perf improve is significant)

February, 2015:
Failed tasks: 4
Mean error: 8%
Paper: Towards ai-complete question answering: a set of prerequisite toy tasks 
Url: https://arxiv.org/abs/1502.05698

March, 2015: 
Failed tasks: 4
Mean error: 7.2%
(MemNN WSH) 39.2 / 17 < find out what WSH means 
(Strongly supervised MemNN) 3.2 / 2 
Paper: End-to-end memory networks 
Url: https://arxiv.org/abs/1503.08895

June, 2015:
Failed tasks: 2
Mean error: 6.4%
Paper: Ask me anything dynamic memory networks for natural language processing (v1)
Url: https://arxiv.org/abs/1506.07285

January, 2016: 
Failed tasks: 2
Mean error: 4.3%
Paper:  Hybrid computing using a neural network with dynamic external memory
Url: http://www.nature.com/nature/journal/v538/n7626/full/nature20101.html#tables

June, 2016 
Failed tasks: 1
Mean error: 2.81%
Paper: Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes
Url: https://arxiv.org/pdf/1607.00036.pdf

2016, October - scaling memory-augmented neural networks with sparse read and writes 
https://arxiv.org/pdf/1610.09027v1.pdf

November, 2016:
Failed tasks: 2
Mean error: 3.7%
Paper: gated end-to-end memory networks
Url: https://arxiv.org/pdf/1610.04211.pdf

December, 2016: 
Failed tasks: 0 
Mean error: 0.5%
Paper:  tracking the world state with recurrent entity networks
Url: https://arxiv.org/abs/1612.03969

Feb, 2017:
Failed tasks: 0
Mean error: 0.3%
Paper:  Query-Reduction Networks for Question Answering v5 
Url: https://arxiv.org/abs/1606.04582