Training Deep Neural Networks

All You Need is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks with Orthonormality and Modulation

intro: CVPR 2017. HIKVision
arxiv: https://arxiv.org/abs/1703.01827

Data-dependent Initializations of Convolutional Neural Networks

arxiv: http://arxiv.org/abs/1511.06856
github: https://github.com/philkr/magic_init

What are good initial weights in a neural network?

stackexchange: http://stats.stackexchange.com/questions/47590/what-are-good-initial-weights-in-a-neural-network

RandomOut: Using a convolutional gradient norm to win The Filter Lottery

arxiv: http://arxiv.org/abs/1602.05931

Categorical Reparameterization with Gumbel-Softmax

intro: Google Brain & University of Cambridge & Stanford University
arxiv: https://arxiv.org/abs/1611.01144
github: https://github.com/ericjang/gumbel-softmax

On weight initialization in deep neural networks

arxiv: https://arxiv.org/abs/1704.08863
github: https://github.com/sidkk86/weight_initialization

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

intro: ImageNet top-5 error: 4.82%
keywords: internal covariate shift problem
arxiv: http://arxiv.org/abs/1502.03167
blog: https://standardfrancis.wordpress.com/2015/04/16/batch-normalization/
notes: http://blog.csdn.net/happynear/article/details/44238541

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

arxiv: http://arxiv.org/abs/1602.07868
github(Lasagne): https://github.com/TimSalimans/weight_norm
github: https://github.com/openai/weightnorm
notes: http://www.erogol.com/my-notes-weight-normalization/

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

arxiv: http://arxiv.org/abs/1603.01431

Revisiting Batch Normalization For Practical Domain Adaptation

intro: Pattern Recognition
keywords: Adaptive Batch Normalization (AdaBN)
arxiv: https://arxiv.org/abs/1603.04779

Implementing Batch Normalization in Tensorflow

blog: http://r2rt.com/implementing-batch-normalization-in-tensorflow.html

Deriving the Gradient for the Backward Pass of Batch Normalization

blog: https://kevinzakka.github.io/2016/09/14/batch_normalization/

Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units

intro: Oculus VR & Facebook & NEC Labs America
paper: https://research.fb.com/publications/exploring-normalization-in-deep-residual-networks-with-concatenated-rectified-linear-units/

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

intro: Sergey Ioffe, Google
arxiv: https://arxiv.org/abs/1702.03275

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification

https://arxiv.org/abs/1709.08145

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

intro: Mapillary Research
arxiv: https://arxiv.org/abs/1712.02616
github: https://github.com/mapillary/inplace_abn

Batch Kalman Normalization: Towards Training Deep Neural Networks with Micro-Batches

https://arxiv.org/abs/1802.03133

Decorrelated Batch Normalization

intro: CVPR 2018
arxiv: https://arxiv.org/abs/1804.08450
github: https://github.com/umich-vl/DecorrelatedBN

Backward pass of BN

Understanding the backward pass through Batch Normalization Layer

https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

Deriving the Gradient for the Backward Pass of Batch Normalization

https://kevinzakka.github.io/2016/09/14/batch_normalization/

What does the gradient flowing through batch normalization looks like ?

http://cthorey.github.io./backpropagation/

Layer Normalization

Layer Normalization

arxiv: https://arxiv.org/abs/1607.06450
github: https://github.com/ryankiros/layer-norm
github(TensorFlow): https://github.com/pbhatia243/tf-layer-norm
github: https://github.com/MycChiu/fast-LayerNorm-TF

Keras GRU with Layer Normalization

gist: https://gist.github.com/udibr/7f46e790c9e342d75dcbd9b1deb9d940

Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

arxiv: https://arxiv.org/abs/1702.05870

Group Normalization

Group Normalization

intro: Facebook AI Research (FAIR)
arxiv: https://arxiv.org/abs/1803.08494

Loss Function

The Loss Surfaces of Multilayer Networks

arxiv: http://arxiv.org/abs/1412.0233

Direct Loss Minimization for Training Deep Neural Nets

arxiv: http://arxiv.org/abs/1511.06411

Nonconvex Loss Functions for Classifiers and Deep Networks

blog: https://casmls.github.io/general/2016/10/27/NonconvexLosses.html

Learning Deep Embeddings with Histogram Loss

arxiv: https://arxiv.org/abs/1611.00822

Large-Margin Softmax Loss for Convolutional Neural Networks

intro: ICML 2016
intro: Peking University & South China University of Technology & CMU & Shenzhen University
arxiv: https://arxiv.org/abs/1612.02295
github(Official. Caffe): https://github.com/wy1iu/LargeMargin_Softmax_Loss
github: https://github.com/luoyetx/mx-lsoftmax
github: https://github.com/tpys/face-recognition-caffe2
github: https://github.com/jihunchoi/lsoftmax-pytorch

An empirical analysis of the optimization of deep network loss surfaces

https://arxiv.org/abs/1612.04010

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

intro: Peking University
arxiv: https://arxiv.org/abs/1706.10239

Hierarchical Softmax

http://building-babylon.net/2017/08/01/hierarchical-softmax/

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

intro: CVPR 2017
arxiv: https://arxiv.org/abs/1708.03769

DropMax: Adaptive Stochastic Softmax

intro: UNIST & Postech & KAIST
arxiv: https://arxiv.org/abs/1712.07834

Rethinking Feature Distribution for Loss Functions in Image Classification

intro: CVPR 2018 spotlight
arxiv: https://arxiv.org/abs/1803.02988

Ensemble Soft-Margin Softmax Loss for Image Classification

intro: IJCAI 2018
arxiv: https://arxiv.org/abs/1805.03922

Learning Rate

No More Pesky Learning Rates

intro: Tom Schaul, Sixin Zhang, Yann LeCun
arxiv: https://arxiv.org/abs/1206.1106

Coupling Adaptive Batch Sizes with Learning Rates

intro: Max Planck Institute for Intelligent Systems
intro: Tensorflow implementation of SGD with Coupled Adaptive Batch Size (CABS)
arxiv: https://arxiv.org/abs/1612.05086
github: https://github.com/ProbabilisticNumerics/cabs

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

https://arxiv.org/abs/1708.07120

Improving the way we work with learning rate.

https://medium.com/@bushaev/improving-the-way-we-work-with-learning-rate-5e99554f163b

WNGrad: Learn the Learning Rate in Gradient Descent

intro: University of Texas at Austin & Facebook AI Research
arxiv: https://arxiv.org/abs/1803.02865

Convolution Filters

Non-linear Convolution Filters for CNN-based Learning

intro: ICCV 2017
arxiv: https://arxiv.org/abs/1708.07038

Pooling

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

intro: ICLR 2013. Matthew D. Zeiler, Rob Fergus
paper: http://www.matthewzeiler.com/pubs/iclr2013/iclr2013.pdf

Multi-scale Orderless Pooling of Deep Convolutional Activation Features

intro: ECCV 2014
intro: MOP-CNN, orderless VLAD pooling, image classification / instance-level retrieval
arxiv: https://arxiv.org/abs/1403.1840
paper: http://web.engr.illinois.edu/~slazebni/publications/yunchao_eccv14_mopcnn.pdf

Fractional Max-Pooling

arxiv: https://arxiv.org/abs/1412.6071
notes: https://gist.github.com/shagunsodhani/ccfe3134f46fd3738aa0
github: https://github.com/torch/nn/issues/371

TI-POOLING: transformation-invariant pooling for feature learning in Convolutional Neural Networks

intro: CVPR 2016
paper: http://dlaptev.org/papers/Laptev16_CVPR.pdf
github: https://github.com/dlaptev/TI-pooling

S3Pool: Pooling with Stochastic Spatial Sampling

arxiv: https://arxiv.org/abs/1611.05138
github(Lasagne): https://github.com/Shuangfei/s3pool

Inductive Bias of Deep Convolutional Networks through Pooling Geometry

arxiv: https://arxiv.org/abs/1605.06743
github: https://github.com/HUJI-Deep/inductive-pooling

Improved Bilinear Pooling with CNNs

https://arxiv.org/abs/1707.06772

**Learning Bag-of-Features Pooling for Deep Convolutional Neural Networks

intro: ICCV 2017
arxiv: https://arxiv.org/abs/1707.08105
github: https://github.com/passalis/cbof

A new kind of pooling layer for faster and sharper convergence

Statistically Motivated Second Order Pooling

https://arxiv.org/abs/1801.07492

Detail-Preserving Pooling in Deep Networks

intro: CVPR 2018
arxiv: https://arxiv.org/abs/1804.04076

Mini-Batch

Online Batch Selection for Faster Training of Neural Networks

intro: Workshop paper at ICLR 2016
arxiv: https://arxiv.org/abs/1511.06343

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

intro: ICLR 2017
arxiv: https://arxiv.org/abs/1609.04836

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

intro: Facebook
keywords: Training with 256 GPUs, minibatches of 8192
arxiv: https://arxiv.org/abs/1706.02677

Scaling SGD Batch Size to 32K for ImageNet Training

https://arxiv.org/abs/1708.03888

ImageNet Training in 24 Minutes

https://arxiv.org/abs/1709.05011

Don’t Decay the Learning Rate, Increase the Batch Size

intro: Google Brain
arxiv: https://arxiv.org/abs/1711.00489

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

intro: NIPS 2017 Workshop: Deep Learning at Supercomputer Scale
arxiv: https://arxiv.org/abs/1711.04325

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

intro: UC Berkeley & NVIDIA
arxiv: https://arxiv.org/abs/1712.02029

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

intro: UC Berkeley & University of Texas
arxiv: https://arxiv.org/abs/1802.08241

Revisiting Small Batch Training for Deep Neural Networks

https://arxiv.org/abs/1804.07612

Optimization Methods

On Optimization Methods for Deep Learning

paper: http://www.icml-2011.org/papers/210_icmlpaper.pdf

Invariant backpropagation: how to train a transformation-invariant neural network

arxiv: http://arxiv.org/abs/1502.04434
github: https://github.com/sdemyanov/ConvNet

A practical theory for designing very deep convolutional neural network

Stochastic Optimization Techniques

intro: SGD/Momentum/NAG/Adagrad/RMSProp/Adadelta/Adam/ESGD/Adasecant/vSGD/Rprop
blog: http://colinraffel.com/wiki/stochastic_optimization_techniques

Alec Radford’s animations for optimization algorithms

http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

Faster Asynchronous SGD (FASGD)

arxiv: http://arxiv.org/abs/1601.04033
github: https://github.com/DoctorTeeth/fred

An overview of gradient descent optimization algorithms (★★★★★)

Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

arxiv: http://arxiv.org/abs/1602.02151

Writing fast asynchronous SGD/AdaGrad with RcppParallel

blog: http://gallery.rcpp.org/articles/rcpp-sgd/

Quick Explanations Of Optimization Methods

blog: http://jxieeducation.com/2016-07-02/Quick-Explanations-of-Optimization-Methods/

Learning to learn by gradient descent by gradient descent

intro: Google DeepMind
arxiv: https://arxiv.org/abs/1606.04474
github: https://github.com/deepmind/learning-to-learn
github(TensorFlow): https://github.com/runopti/Learning-To-Learn
github(PyTorch): https://github.com/ikostrikov/pytorch-meta-optimizer

SGDR: Stochastic Gradient Descent with Restarts

arxiv: http://arxiv.org/abs/1608.03983
github: https://github.com/loshchil/SGDR

The zen of gradient descent

blog: http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html

Big Batch SGD: Automated Inference using Adaptive Batch Sizes

arxiv: https://arxiv.org/abs/1610.05792

Improving Stochastic Gradient Descent with Feedback

arxiv: https://arxiv.org/abs/1611.01505
github: https://github.com/jayanthkoushik/sgd-feedback
github: https://github.com/tdeboissiere/DeepLearningImplementations/tree/master/Eve

Learning Gradient Descent: Better Generalization and Longer Horizons

intro: Tsinghua University
arxiv: https://arxiv.org/abs/1703.03633
github(TensorFlow): https://github.com/vfleaking/rnnprop

Optimization Algorithms

Gradient Normalization & Depth Based Decay For Deep Learning

intro: Columbia University
arxiv: https://arxiv.org/abs/1712.03607

Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks

intro: Google Research
arxiv: https://arxiv.org/abs/1712.03298

Optimization for Deep Learning Highlights in 2017

http://ruder.io/deep-learning-optimization-2017/index.html

Gradients explode - Deep Networks are shallow - ResNet explained

intro: CMU & UC Berkeley
arxiv: https://arxiv.org/abs/1712.05577

Adam

Adam: A Method for Stochastic Optimization

intro: ICLR 2015
arxiv: http://arxiv.org/abs/1412.6980

Fixing Weight Decay Regularization in Adam

intro: University of Freiburg
arxiv: https://arxiv.org/abs/1711.05101
github: https://github.com/loshchil/AdamW-and-SGDW
github: https://github.com/fastai/fastai/pull/46/files

Tensor Methods

Tensorizing Neural Networks

intro: TensorNet
arxiv: http://arxiv.org/abs/1509.06569
github(Matlab+Theano+Lasagne): https://github.com/Bihaqo/TensorNet
github(TensorFlow): https://github.com/timgaripov/TensorNet-TF

Tensor methods for training neural networks

homepage: http://newport.eecs.uci.edu/anandkumar/#home
youtube: https://www.youtube.com/watch?v=B4YvhcGaafw
slides: http://newport.eecs.uci.edu/anandkumar/slides/Strata-NY.pdf
talks: http://newport.eecs.uci.edu/anandkumar/#talks

Regularization

DisturbLabel: Regularizing CNN on the Loss Layer

intro: University of California & MSR 2016
intro: “an extremely simple algorithm which randomly replaces a part of labels as incorrect values in each iteration”
paper: http://research.microsoft.com/en-us/um/people/jingdw/pubs/cvpr16-disturblabel.pdf

Robust Convolutional Neural Networks under Adversarial Noise

intro: ICLR 2016
arxiv: http://arxiv.org/abs/1511.06306

Adding Gradient Noise Improves Learning for Very Deep Networks

intro: ICLR 2016
arxiv: http://arxiv.org/abs/1511.06807

Stochastic Function Norm Regularization of Deep Networks

arxiv: http://arxiv.org/abs/1605.09085
github: https://github.com/AmalRT/DNN_Reg

SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks

arxiv: http://arxiv.org/abs/1609.06693

Regularizing neural networks by penalizing confident predictions

intro: Gabriel Pereyra, George Tucker, Lukasz Kaiser, Geoffrey Hinton [Google Brain
dropbox: https://www.dropbox.com/s/8kqf4v2c9lbnvar/BayLearn%202016%20(gjt).pdf?dl=0
mirror: https://pan.baidu.com/s/1kUUtxdl

Automatic Node Selection for Deep Neural Networks using Group Lasso Regularization

arxiv: https://arxiv.org/abs/1611.05527

Regularization in deep learning

LDMNet: Low Dimensional Manifold Regularized Neural Networks

https://arxiv.org/abs/1711.06246

Learning Sparse Neural Networks through L0 Regularization

intro: University of Amsterdam & OpenAI
arxiv: https://arxiv.org/abs/1712.01312

Regularization and Optimization strategies in Deep Convolutional Neural Network

https://arxiv.org/abs/1712.04711

Regularizing Deep Networks by Modeling and Predicting Label Structure

intro: CVPR 2018
arxiv: https://arxiv.org/abs/1804.02009

Dropout

Improving neural networks by preventing co-adaptation of feature detectors

intro: Dropout
arxiv: http://arxiv.org/abs/1207.0580

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

paper: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Fast dropout training

paper: http://jmlr.org/proceedings/papers/v28/wang13a.pdf
github: https://github.com/sidaw/fastdropout

Dropout as data augmentation

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

arxiv: http://arxiv.org/abs/1512.05287
github: https://github.com/yaringal/BayesianRNN

Improved Dropout for Shallow and Deep Learning

arxiv: http://arxiv.org/abs/1602.02220

Dropout Regularization in Deep Learning Models With Keras

blog: http://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

Dropout with Expectation-linear Regularization

arxiv: http://arxiv.org/abs/1609.08017

Dropout with Theano

Information Dropout: learning optimal representations through noise

arxiv: https://arxiv.org/abs/1611.01353

Recent Developments in Dropout

blog: https://casmls.github.io/general/2016/11/11/dropout.html

Generalized Dropout

arxiv: https://arxiv.org/abs/1611.06791

Analysis of Dropout

blog: https://pgaleone.eu/deep-learning/regularization/2017/01/10/anaysis-of-dropout/

Variational Dropout Sparsifies Deep Neural Networks

arxiv: https://arxiv.org/abs/1701.05369

Learning Deep Networks from Noisy Labels with Dropout Regularization

intro: 2016 IEEE 16th International Conference on Data Mining
arxiv: https://arxiv.org/abs/1705.03419

Concrete Dropout

intro: University of Cambridge
arxiv: https://arxiv.org/abs/1705.07832
github: https://github.com/yaringal/ConcreteDropout

Analysis of dropout learning regarded as ensemble learning

intro: Nihon University
arxiv: https://arxiv.org/abs/1706.06859

An Analysis of Dropout for Matrix Factorization

https://arxiv.org/abs/1710.03487

Analysis of Dropout in Online Learning

https://arxiv.org/abs/1711.03343

Regularization of Deep Neural Networks with Spectral Dropout

https://arxiv.org/abs/1711.08591

Data Dropout in Arbitrary Basis for Deep Network Regularization

https://arxiv.org/abs/1712.00891

DropConnect

Regularization of Neural Networks using DropConnect

homepage: http://cs.nyu.edu/~wanli/dropc/
gitxiv: http://gitxiv.com/posts/rJucpiQiDhQ7HkZoX/regularization-of-neural-networks-using-dropconnect
github: https://github.com/iassael/torch-dropconnect

Regularizing neural networks with dropout and with DropConnect

blog: http://fastml.com/regularizing-neural-networks-with-dropout-and-with-dropconnect/

DropNeuron

DropNeuron: Simplifying the Structure of Deep Neural Networks

arxiv: http://arxiv.org/abs/1606.07326
github: https://github.com/panweihit/DropNeuron

Maxout

Maxout Networks

intro: ICML 2013
intro: “its output is the max of a set of inputs, a natural companion to dropout”
project page: http://www-etud.iro.umontreal.ca/~goodfeli/maxout.html
arxiv: https://arxiv.org/abs/1302.4389
github: https://github.com/lisa-lab/pylearn2/blob/master/pylearn2/models/maxout.py

Improving Deep Neural Networks with Probabilistic Maxout Units

arxiv: https://arxiv.org/abs/1312.6116

Swapout

Swapout: Learning an ensemble of deep architectures

Whiteout

Whiteout: Gaussian Adaptive Regularization Noise in Deep Neural Networks

intro: University of Notre Dame & University of Science and Technology of China
arxiv: https://arxiv.org/abs/1612.01490

ShakeDrop regularization

https://arxiv.org/abs/1802.02375

Gradient Descent

RMSProp: Divide the gradient by a running average of its recent magnitude

intro: it was not proposed in a paper, in fact it was just introduced in a slide in Geoffrey Hinton’s Coursera class
slides: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Fitting a model via closed-form equations vs. Gradient Descent vs Stochastic Gradient Descent vs Mini-Batch Learning. What is the difference?(Normal Equations vs. GD vs. SGD vs. MB-GD)

http://sebastianraschka.com/faq/docs/closed-form-vs-gd.html

An Introduction to Gradient Descent in Python

blog: http://tillbergmann.com/blog/articles/python-gradient-descent.html

Train faster, generalize better: Stability of stochastic gradient descent

arxiv: http://arxiv.org/abs/1509.01240

A Variational Analysis of Stochastic Gradient Algorithms

arxiv: http://arxiv.org/abs/1602.02666

The vanishing gradient problem: Oh no — an obstacle to deep learning!

blog: https://medium.com/a-year-of-artificial-intelligence/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b#.50hu5vwa8

Gradient Descent For Machine Learning

blog: http://machinelearningmastery.com/gradient-descent-for-machine-learning/

Revisiting Distributed Synchronous SGD

arxiv: http://arxiv.org/abs/1604.00981

Convergence rate of gradient descent

blog: https://building-babylon.net/2016/06/23/convergence-rate-of-gradient-descent/

A Robust Adaptive Stochastic Gradient Method for Deep Learning

intro: IJCNN 2017 Accepted Paper, An extension of paper, “ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient”
intro: Universite de Montreal & University of Oxford
arxiv: https://arxiv.org/abs/1703.00788

Accelerating Stochastic Gradient Descent

https://arxiv.org/abs/1704.08227

Gentle Introduction to the Adam Optimization Algorithm for Deep Learning

blog: http://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

Understanding Generalization and Stochastic Gradient Descent

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

intro: Google Brain
arxiv: https://arxiv.org/abs/1710.06451

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

intro: UC Berkeley & Microsoft Research, India
arxiv: https://arxiv.org/abs/1711.10456

Improving Generalization Performance by Switching from Adam to SGD

https://arxiv.org/abs/1712.07628

AdaGrad

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

paper: http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

ADADELTA: An Adaptive Learning Rate Method

arxiv: http://arxiv.org/abs/1212.5701

Momentum

On the importance of initialization and momentum in deep learning

intro: NAG: Nesterov
paper: http://www.cs.toronto.edu/~fritz/absps/momentum.pdf
paper: http://jmlr.org/proceedings/papers/v28/sutskever13.pdf

YellowFin and the Art of Momentum Tuning

intro: Stanford University
intro: auto-tuning momentum SGD optimizer
project page: http://cs.stanford.edu/~zjian/project/YellowFin/
arxiv: https://arxiv.org/abs/1706.03471
github(TensorFlow): https://github.com/JianGoForIt/YellowFin https://github.com/JianGoForIt/YellowFin_Pytorch

Backpropagation

Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks

intro: ECCV 2016. first place of ILSVRC 2015 Scene Classification Challenge
arxiv: https://arxiv.org/abs/1512.05830
paper: http://www.cis.pku.edu.cn/faculty/vision/zlin/Publications/2016-ECCV-RelayBP.pdf

Top-down Neural Attention by Excitation Backprop

intro: ECCV, 2016 (oral)
projpage: http://cs-people.bu.edu/jmzhang/excitationbp.html
arxiv: http://arxiv.org/abs/1608.00507
paper: http://cs-people.bu.edu/jmzhang/EB/ExcitationBackprop.pdf
github: https://github.com/jimmie33/Caffe-ExcitationBP

Towards a Biologically Plausible Backprop

Sampled Backpropagation: Training Deep and Wide Neural Networks on Large Scale, User Generated Content Using Label Sampling

blog: https://medium.com/@karl1980.lab41/sampled-backpropagation-27ac58d5c51c#.xnbhyxtou

The Reversible Residual Network: Backpropagation Without Storing Activations

intro: CoRR 2017. University of Toronto
arxiv: https://arxiv.org/abs/1707.04585
github: https://github.com/renmengye/revnet-public

meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting

intro: ICML 2017
arxiv: https://arxiv.org/abs/1706.06197
github: https://github.com//jklj077/meProp

Accelerate Training

Neural Networks with Few Multiplications

intro: ICLR 2016
arxiv: https://arxiv.org/abs/1510.03009

Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices

arxiv: http://arxiv.org/abs/1603.07341

Deep Q-Networks for Accelerating the Training of Deep Neural Networks

arxiv: http://arxiv.org/abs/1606.01467
github: https://github.com/bigaidream-projects/qan

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

arxiv: http://arxiv.org/abs/1606.04487

Parallelism

One weird trick for parallelizing convolutional neural networks

author: Alex Krizhevsky
arxiv: http://arxiv.org/abs/1404.5997

8-Bit Approximations for Parallelism in Deep Learning (ICLR 2016)

arxiv: http://arxiv.org/abs/1511.04561

Handling Datasets

Data Augmentation

DataAugmentation ver1.0: Image data augmentation tool for training of image recognition algorithm

github: https://github.com/takmin/DataAugmentation

Caffe-Data-Augmentation: a branc caffe with feature of Data Augmentation using a configurable stochastic combination of 7 data augmentation techniques

github: https://github.com/ShaharKatz/Caffe-Data-Augmentation

Image Augmentation for Deep Learning With Keras

blog: http://machinelearningmastery.com/image-augmentation-deep-learning-keras/

What you need to know about data augmentation for machine learning

intro: keras Imagegenerator
blog: https://cartesianfaith.com/2016/10/06/what-you-need-to-know-about-data-augmentation-for-machine-learning/

HZPROC: torch data augmentation toolbox (supports affine transform)

github: https://github.com/zhanghang1989/hzproc

AGA: Attribute Guided Augmentation

intro: one-shot recognition
arxiv: https://arxiv.org/abs/1612.02559

Accelerating Deep Learning with Multiprocess Image Augmentation in Keras

Comprehensive Data Augmentation and Sampling for Pytorch

github: https://github.com/ncullen93/torchsample

Image augmentation for machine learning experiments.

https://github.com/aleju/imgaug

Google/inception’s data augmentation: scale and aspect ratio augmentation

https://github.com/facebook/fb.resnet.torch/blob/master/datasets/transforms.lua#L130

Caffe Augmentation Extension

intro: Data Augmentation for Caffe
github: https://github.com/twtygqyy/caffe-augmentation

Improving Deep Learning using Generic Data Augmentation

intro: University of Cape Town
arxiv: https://arxiv.org/abs/1708.06020
github: https://github.com/webstorms/AugmentedDatasets

Augmentor: An Image Augmentation Library for Machine Learning

arxiv: https://arxiv.org/abs/1708.04680
github: https://github.com/mdbloice/Augmentor

Learning to Compose Domain-Specific Transformations for Data Augmentation

https://arxiv.org/abs/1709.01643

Data Augmentation in Classification using GAN

https://arxiv.org/abs/1711.00648

Data Augmentation Generative Adversarial Networks

https://arxiv.org/abs/1711.04340

Random Erasing Data Augmentation

arxiv: https://arxiv.org/abs/1708.04896
github: https://github.com/zhunzhong07/Random-Erasing

Context Augmentation for Convolutional Neural Networks

https://arxiv.org/abs/1712.01653

The Effectiveness of Data Augmentation in Image Classification using Deep Learning

https://arxiv.org/abs/1712.04621

MentorNet: Regularizing Very Deep Neural Networks on Corrupted Labels

intro: Google Inc & Stanford University
arxiv: https://arxiv.org/abs/1712.05055

mixup: Beyond Empirical Risk Minimization

intro: MIT & FAIR
arxiv: https://arxiv.org/abs/1710.09412
github: https://github.com//leehomyc/mixup_pytorch
github: https://github.com//unsky/mixup

mixup: Data-Dependent Data Augmentation

http://www.inference.vc/mixup-data-dependent-data-augmentation/

Data Augmentation by Pairing Samples for Images Classification

intro: IBM Research - Tokyo
arxiv: https://arxiv.org/abs/1801.02929

Feature Space Transfer for Data Augmentation

keywords: eATure TransfEr Network (FATTEN)
arxiv: https://arxiv.org/abs/1801.04356

Visual Data Augmentation through Learning

https://arxiv.org/abs/1801.06665

Data Augmentation Generative Adversarial Networks

arxiv: https://arxiv.org/abs/1711.04340
github: https://github.com/AntreasAntoniou/DAGAN

BAGAN: Data Augmentation with Balancing GAN

https://arxiv.org/abs/1803.09655

Parallel Grid Pooling for Data Augmentation

intro: The University of Tokyo & NTT Communications Science Laboratories
arxiv: https://arxiv.org/abs/1803.11370
github(Chainer): https://github.com/akitotakeki/pgp-chainer

Imbalanced Datasets

Investigation on handling Structured & Imbalanced Datasets with Deep Learning

intro: smote resampling, cost sensitive learning
blog: https://www.analyticsvidhya.com/blog/2016/10/investigation-on-handling-structured-imbalanced-datasets-with-deep-learning/

A systematic study of the class imbalance problem in convolutional neural networks

intro: Duke University & Royal Institute of Technology (KTH)
arxiv: https://arxiv.org/abs/1710.05381

Class Rectification Hard Mining for Imbalanced Deep Learning

https://arxiv.org/abs/1712.03162

Bridging the Gap: Simultaneous Fine Tuning for Data Re-Balancing

arxiv: https://arxiv.org/abs/1801.02548
github: https://github.com/JohnMcKay/dataImbalance

Noisy / Unlabelled Data

Data Distillation: Towards Omni-Supervised Learning

intro: Facebook AI Research (FAIR)
arxiv: https://arxiv.org/abs/1712.04440

Learning From Noisy Singly-labeled Data

intro: University of Illinois Urbana Champaign & CMU & Caltech & Amazon AI
arxiv: https://arxiv.org/abs/1712.04577

Low Numerical Precision

Training deep neural networks with low precision multiplications

intro: ICLR 2015
intro: Maxout networks, 10-bit activations, 12-bit parameter updates
arxiv: http://arxiv.org/abs/1412.7024
github: https://github.com/MatthieuCourbariaux/deep-learning-multipliers

Deep Learning with Limited Numerical Precision

intro: ICML 2015
arxiv: http://arxiv.org/abs/1502.02551

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

Binarized Neural Networks

arxiv: http://arxiv.org/abs/1602.02505

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

arxiv: http://arxiv.org/abs/1602.02830
github: https://github.com/MatthieuCourbariaux/BinaryNet
github: https://github.com/codekansas/tinier-nn

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

arxiv: http://arxiv.org/abs/1609.07061

Distributed Training

Large Scale Distributed Systems for Training Neural Networks

intro: By Jeff Dean & Oriol Vinyals, Google. NIPS 2015.
slides: https://media.nips.cc/Conferences/2015/tutorialslides/Jeff-Oriol-NIPS-Tutorial-2015.pdf
video: http://research.microsoft.com/apps/video/default.aspx?id=259564&l=i
mirror: http://pan.baidu.com/s/1mgXV0hU

Large Scale Distributed Deep Networks

intro: distributed CPU training, data parallelism, model parallelism
paper: http://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf
slides: http://admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdf

Implementation of a Practical Distributed Calculation System with Browsers and JavaScript, and Application to Distributed Deep Learning

project page: http://mil-tokyo.github.io/
arxiv: https://arxiv.org/abs/1503.05743

SparkNet: Training Deep Networks in Spark

A Scalable Implementation of Deep Learning on Spark

intro: Alexander Ulanov
slides: http://www.slideshare.net/AlexanderUlanov1/a-scalable-implementation-of-deep-learning-on-spark-alexander-ulanov
mirror: http://pan.baidu.com/s/1jHiNW5C

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Distributed Supervised Learning using Neural Networks

intro: Ph.D. thesis
arxiv: http://arxiv.org/abs/1607.06364

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

arxiv: http://arxiv.org/abs/1609.06870

How to scale distributed deep learning?

intro: Extended version of paper accepted at ML Sys 2016 (at NIPS 2016)
arxiv: https://arxiv.org/abs/1611.04581

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

intro: Tsinghua University & Stanford University
comments: we find 99.9% of the gradient exchange in distributed SGD is redundant; we reduce the communication bandwidth by two orders of magnitude without losing accuracy
keywords: momentum correction, local gradient clipping, momentum factor masking, and warm-up training
arxiv: https://arxiv.org/abs/1712.01887

Distributed learning of CNNs on heterogeneous CPU/GPU architectures

https://arxiv.org/abs/1712.02546

Integrated Model and Data Parallelism in Training Neural Networks

intro: UC Berkeley & Lawrence Berkeley National Laboratory
arxiv: https://arxiv.org/abs/1712.04432

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

intro: ICLR 2018
intro: we find 99.9% of the gradient exchange in distributed SGD is redundant; we reduce the communication bandwidth by two orders of magnitude without losing accuracy
arxiv: https://arxiv.org/abs/1712.01887

Projects

Theano-MPI: a Theano-based Distributed Training Framework

arxiv: https://arxiv.org/abs/1605.08325
github: https://github.com/uoguelph-mlrg/Theano-MPI

CaffeOnSpark: Open Sourced for Distributed Deep Learning on Big Data Clusters

intro: Yahoo Big ML Team
blog: http://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep
github: https://github.com/yahoo/CaffeOnSpark
youtube: https://www.youtube.com/watch?v=bqj7nML-aHk

Tunnel: Data Driven Framework for Distributed Computing in Torch 7

github: https://github.com/zhangxiangxiao/tunnel

Distributed deep learning with Keras and Apache Spark

project page: http://joerihermans.com/work/distributed-keras/
github: https://github.com/JoeriHermans/dist-keras

BigDL: Distributed Deep learning Library for Apache Spark

github: https://github.com/intel-analytics/BigDL

Videos

A Scalable Implementation of Deep Learning on Spark

youtube: https://www.youtube.com/watch?v=pNYBBhuK8yU
mirror: http://pan.baidu.com/s/1mhzF1uK

Distributed TensorFlow on Spark: Scaling Google’s Deep Learning Library (Spark Summit)

youtube: https://www.youtube.com/watch?v=-QtcP3yRqyM
mirror: http://pan.baidu.com/s/1mgOR1GG

Deep Recurrent Neural Networks for Sequence Learning in Spark (Spark Summit)

youtube: https://www.youtube.com/watch?v=mUuqLcl8Jog
mirror: http://pan.baidu.com/s/1sklHTPr

Distributed deep learning on Spark

author: Alexander Ulanov July 12, 2016
intro: Alexander Ulanov offers an overview of tools and frameworks that have been proposed for performing deep learning on Spark.
video: https://www.oreilly.com/learning/distributed-deep-learning-on-spark

Blogs

Distributed Deep Learning Reads

https://github.com//tmulc18/DistributedDeepLearningReads

Hadoop, Spark, Deep Learning Mesh on Single GPU Cluster

http://www.nextplatform.com/2016/02/24/hadoop-spark-deep-learning-mesh-on-single-gpu-cluster/

The Unreasonable Effectiveness of Deep Learning on Spark

https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html

Distributed Deep Learning with Caffe Using a MapR Cluster

https://www.mapr.com/blog/distributed-deep-learning-caffe-using-mapr-cluster

Deep Learning with Apache Spark and TensorFlow

https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

Deeplearning4j on Spark

http://deeplearning4j.org/spark

Distributed Deep Learning, Part 1: An Introduction to Distributed Training of Neural Networks

blog: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

GPU Acceleration in Databricks: Speeding Up Deep Learning on Apache Spark

https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html

Distributed Deep Learning with Apache Spark and Keras

https://db-blog.web.cern.ch/blog/joeri-hermans/2017-01-distributed-deep-learning-apache-spark-and-keras

Adversarial Training

Learning from Simulated and Unsupervised Images through Adversarial Training

intro: CVPR 2017 oral, best paper award. Apple Inc.
arxiv: https://arxiv.org/abs/1612.07828

The Robust Manifold Defense: Adversarial Training using Generative Models

https://arxiv.org/abs/1712.09196

DeepDefense: Training Deep Neural Networks with Improved Robustness

https://arxiv.org/abs/1803.00404

Low-Precision Training

High-Accuracy Low-Precision Training

intro: Cornell University & Stanford University
arxiv: https://arxiv.org/abs/1803.03383

Incremental Training

ClickBAIT: Click-based Accelerated Incremental Training of Convolutional Neural Networks

arxiv: https://arxiv.org/abs/1709.05021
dataset: http://clickbait.crossmobile.info/

ClickBAIT-v2: Training an Object Detector in Real-Time

https://arxiv.org/abs/1803.10358

Papers

Understanding the difficulty of training deep feed forward neural networks

intro: Xavier initialization
paper: http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

Domain-Adversarial Training of Neural Networks

arxiv: https://arxiv.org/abs/1505.07818
paper: http://jmlr.org/papers/v17/15-239.html
github: https://github.com/pumpikano/tf-dann

Scalable and Sustainable Deep Learning via Randomized Hashing

arxiv: http://arxiv.org/abs/1602.08194

Training Deep Nets with Sublinear Memory Cost

arxiv: https://arxiv.org/abs/1604.06174
github: https://github.com/dmlc/mxnet-memonger
github: https://github.com/Bihaqo/tf-memonger

Improving the Robustness of Deep Neural Networks via Stability Training

arxiv: http://arxiv.org/abs/1604.04326

Faster Training of Very Deep Networks Via p-Norm Gates

arxiv: http://arxiv.org/abs/1608.03639

Fast Training of Convolutional Neural Networks via Kernel Rescaling

arxiv: https://arxiv.org/abs/1610.03623

FreezeOut: Accelerate Training by Progressively Freezing Layers

arxiv: https://arxiv.org/abs/1706.04983
github: https://github.com/ajbrock/FreezeOut

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

intro: CMU & The University of Iowa
arxiv: https://arxiv.org/abs/1707.04822

Image Quality Assessment Guided Deep Neural Networks Training

https://arxiv.org/abs/1708.03880

An Effective Training Method For Deep Convolutional Neural Network

intro: Beijing Institute of Technology & Tsinghua University
arxiv: https://arxiv.org/abs/1708.01666

On the Importance of Consistency in Training Deep Neural Networks

intro: University of Maryland & Arizona State University
arxiv: https://arxiv.org/abs/1708.00631

Solving internal covariate shift in deep learning with linked neurons

intro: Universitat de Barcelona
arxiv: https://arxiv.org/abs/1712.02609
github: https://github.com/blauigris/linked_neurons

Tools

pastalog: Simple, realtime visualization of neural network training performance

github: https://github.com/rewonc/pastalog

torch-pastalog: A Torch interface for pastalog - simple, realtime visualization of neural network training performance

github: https://github.com/Kaixhin/torch-pastalog

Blogs

Important nuances to train deep learning models

http://www.erogol.com/important-nuances-train-deep-learning-models/

Train your deep model faster and sharper — two novel techniques

https://hackernoon.com/training-your-deep-model-faster-and-sharper-e85076c3b047

Training Deep Neural Networks

Tutorials

Activation functions

ReLU

LReLU

PReLU

SReLU

MBA

Concatenated ReLU (CRelu)

GELU

SELU

EraseReLU

Swish

Series on Initialization of Weights for DNN

Weights Initialization

Batch Normalization

Backward pass of BN

Layer Normalization

Group Normalization

Loss Function

Learning Rate

Convolution Filters

Pooling

Mini-Batch

Optimization Methods

Adam

Tensor Methods

Regularization

Dropout

DropConnect

DropNeuron

Maxout

Swapout

Whiteout

Gradient Descent

AdaGrad

Momentum

Backpropagation

Accelerate Training

Parallelism

Handling Datasets

Data Augmentation

Imbalanced Datasets

Noisy / Unlabelled Data

Low Numerical Precision

Distributed Training

Projects

Videos

Blogs

Adversarial Training

Low-Precision Training

Incremental Training

Papers

Tools

Blogs

About me

Recent Posts

Links