This document provides an overview of convolutional neural networks (CNNs) for image and video recognition. It discusses that CNNs have greatly improved image classification accuracy on ImageNet over the years. CNNs consist of convolutional layers that apply filters to extract features, pooling layers that reduce the spatial size, and fully connected layers for classification. Training involves tuning parameters through backpropagation, while inference uses a trained model for classification. Example networks discussed include AlexNet, VGG16, GoogLeNet and ResNet, which contain increasing numbers of parameters and computational operations.