{
"cells": [
{
"cell_type": "code",
"execution_count": 30,
"id": "ed2ba8f2-1739-4056-a928-df58282076b3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[NbConvertApp] WARNING | Config option `kernel_spec_manager_class` not recognized by `NbConvertApp`.\n",
"[NbConvertApp] Converting notebook 8_8_6_Exercises.ipynb to markdown\n",
"[NbConvertApp] Writing 9402 bytes to 8_8_6_Exercises.md\n"
]
}
],
"source": [
"!jupyter nbconvert --to markdown 8_8_6_Exercises.ipynb"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "25305ff5-77d3-47fe-805d-e92e63e3ee83",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import torch.nn as nn\n",
"import torch\n",
"import warnings\n",
"sys.path.append('/home/jovyan/work/d2l_solutions/notebooks/exercises/d2l_utils/')\n",
"import d2l\n",
"from torchsummary import summary\n",
"warnings.filterwarnings(\"ignore\")\n",
"\n",
"class Alexnet(d2l.Classifier):\n",
" def __init__(self,lr=0.1, num_classes=10):\n",
" super().__init__()\n",
" self.save_hyperparameters()\n",
" self.net = nn.Sequential(nn.LazyConv2d(96, kernel_size=11, stride=4, padding=1),\n",
" nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2),\n",
" nn.LazyConv2d(256, kernel_size=5, padding=2),\n",
" nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2),\n",
" nn.LazyConv2d(384, kernel_size=3, padding=1),nn.ReLU(),\n",
" nn.LazyConv2d(384, kernel_size=3, padding=1),nn.ReLU(),\n",
" nn.LazyConv2d(256, kernel_size=3, padding=1),nn.ReLU(),\n",
" nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),\n",
" nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),\n",
" nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),\n",
" nn.LazyLinear(num_classes)\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "b908c2c9-9fdb-40f5-9df9-211a8e1c15dd",
"metadata": {},
"source": [
"# 1. Following up on the discussion above, analyze the computational properties of AlexNet.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "68b9a42a-90da-4c0d-a314-6a8b112f730c",
"metadata": {},
"source": [
"## 1.1 Compute the memory footprint for convolutions and fully connected layers, respectively. Which one dominates?\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "a9315196-9107-454f-b0bd-bec6e7e7c089",
"metadata": {},
"source": [
"Fomula of the number of parameters of convolutions is $\\sum^{layers}(c_i*c_o*k_h*k_w+c_o)$"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "6d742699-0ff7-488b-952c-90d59898f4c0",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"3747200"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"3*96*11*11+96+96*256*5*5+256+256*384*3*3+384+384*384*3*3+384+384*256*3*3+256"
]
},
{
"cell_type": "markdown",
"id": "c3c9f498-552b-461a-9b96-afc954bbc68e",
"metadata": {},
"source": [
"Fomula of the number of parameters of fully connected is $\\sum^{layers}(x_i*x_o+x_o)$"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "6d1c3d49-3031-4941-9ccb-2b3a7a2c4fdb",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"43040778"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"80*80*4096+4096+4096*4096+4096+4096*10+10"
]
},
{
"cell_type": "markdown",
"id": "8b70c015-82c6-407a-a99c-266fbdd452eb",
"metadata": {},
"source": [
"The **fully connected layers** dominates"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "845095a8-e19a-4782-9c17-47d9c1ae37cb",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'conv': 3747200, 'lr': 43040778}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = Alexnet()\n",
"X = torch.randn(1,3, 224, 224)\n",
"_ = model(X)\n",
"params = {'conv':0, 'lr':0}\n",
"for idx, module in enumerate(model.net):\n",
" if type(module) not in (nn.Linear,nn.Conv2d):\n",
" continue\n",
" num = sum(p.numel() for p in module.parameters())\n",
" # print(f\"Module {idx + 1}: {num} parameters type:{type(module)}\")\n",
" if type(module) == nn.Conv2d:\n",
" params['conv'] += num\n",
" \n",
" else:\n",
" params['lr'] += num\n",
"\n",
"params"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "aa84e760-166e-4c6b-b572-7fa99ae25dbf",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"----------------------------------------------------------------\n",
" Layer (type) Output Shape Param #\n",
"================================================================\n",
" Conv2d-1 [-1, 96, 54, 54] 34,944\n",
" ReLU-2 [-1, 96, 54, 54] 0\n",
" MaxPool2d-3 [-1, 96, 26, 26] 0\n",
" Conv2d-4 [-1, 256, 26, 26] 614,656\n",
" ReLU-5 [-1, 256, 26, 26] 0\n",
" MaxPool2d-6 [-1, 256, 12, 12] 0\n",
" Conv2d-7 [-1, 384, 12, 12] 885,120\n",
" ReLU-8 [-1, 384, 12, 12] 0\n",
" Conv2d-9 [-1, 384, 12, 12] 1,327,488\n",
" ReLU-10 [-1, 384, 12, 12] 0\n",
" Conv2d-11 [-1, 256, 12, 12] 884,992\n",
" ReLU-12 [-1, 256, 12, 12] 0\n",
" MaxPool2d-13 [-1, 256, 5, 5] 0\n",
" Flatten-14 [-1, 6400] 0\n",
" Linear-15 [-1, 4096] 26,218,496\n",
" ReLU-16 [-1, 4096] 0\n",
" Dropout-17 [-1, 4096] 0\n",
" Linear-18 [-1, 4096] 16,781,312\n",
" ReLU-19 [-1, 4096] 0\n",
" Dropout-20 [-1, 4096] 0\n",
" Linear-21 [-1, 10] 40,970\n",
"================================================================\n",
"Total params: 46,787,978\n",
"Trainable params: 46,787,978\n",
"Non-trainable params: 0\n",
"----------------------------------------------------------------\n",
"Input size (MB): 0.57\n",
"Forward/backward pass size (MB): 10.22\n",
"Params size (MB): 178.48\n",
"Estimated Total Size (MB): 189.28\n",
"----------------------------------------------------------------\n"
]
}
],
"source": [
"summary(model, (3, 224, 224))"
]
},
{
"cell_type": "markdown",
"id": "6057efe5-b929-4358-8a15-b1ad19eb316c",
"metadata": {},
"source": [
"## 1.2 Calculate the computational cost for the convolutions and the fully connected layers."
]
},
{
"cell_type": "markdown",
"id": "27390c9f-97fc-46af-9a62-7ccdaa3809ac",
"metadata": {},
"source": [
"Fomula of the computational cost for convolutions is $\\sum^{layers}(c_i*c_o*k_h*k_w*h_o*w_o)$"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "5a3af9f0-68f9-45a5-bd5d-ee19d0362dfd",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"962858112"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"3*96*11*11*54*54+96*256*5*5*26*26+256*384*3*3*12*12+384*384*3*3*12*12+384*256*3*3*12*12"
]
},
{
"cell_type": "markdown",
"id": "299f6251-3fde-41f2-b6b7-dc4986c0b655",
"metadata": {
"tags": []
},
"source": [
"Fomula of the computational cost for fully connected layers is $\\sum^{layers}(x_i*x_o+x_o)$"
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "09ca9962-357b-4e57-a4e9-e752e9a7a543",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"43040778"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"80*80*4096+4096+4096*4096+4096+4096*10+10"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "a6843479-d402-4d15-8267-a2d8bb2f57f2",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'conv': 962858112, 'lr': 43040778}"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = torch.randn(1,3, 224, 224)\n",
"params = {'conv':0, 'lr':0}\n",
"for idx, module in enumerate(model.net):\n",
" c_i = x.shape[1]\n",
" x = module(x)\n",
" if type(module) == nn.Conv2d:\n",
" k = [p.shape for p in module.parameters()]\n",
" c_o,h_o,w_o = x.shape[1], x.shape[2], x.shape[3]\n",
" params['conv'] += c_i*c_o*h_o*w_o*k[0][-1]*k[0][-2]\n",
" if type(module) == nn.Linear:\n",
" params['lr'] += sum(p.numel() for p in module.parameters())\n",
"params"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "18ec5a97-60e3-4b22-95b9-2eae4e792648",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total parameters: 46787978\n"
]
}
],
"source": [
"X = torch.randn(1,3, 224, 224)\n",
"_ = model(X)\n",
"total_params = sum(p.numel() for p in model.parameters())\n",
"print(\"Total parameters:\", total_params)"
]
},
{
"cell_type": "markdown",
"id": "0b2be978-176c-49b1-a24a-4925f55984ea",
"metadata": {},
"source": [
"## 1.3 How does the memory (read and write bandwidth, latency, size) affect computation? Is there any difference in its effects for training and inference?"
]
},
{
"cell_type": "markdown",
"id": "cc8dbde6-9721-4ca0-8973-25b2aefd11b7",
"metadata": {},
"source": [
"Memory characteristics, including read and write bandwidth, latency, and size, have a significant impact on computation in both training and inference of neural networks. These factors can influence the overall performance, efficiency, and speed of the computation. Here's how these memory aspects affect computation and any potential differences between training and inference:\n",
"\n",
"**Read and Write Bandwidth**:\n",
"- **Effect**: High read and write bandwidth enables faster data movement between memory and processing units. It allows for efficient data retrieval and storage during computations.\n",
"- **Impact**: Faster data access improves the throughput of computations, reducing the time spent waiting for data to arrive. It can lead to faster training and inference times.\n",
"- **Training vs. Inference**: Both training and inference benefit from high bandwidth, as data movement is a common bottleneck in both phases.\n",
"\n",
"**Latency**:\n",
"- **Effect**: Latency is the time taken to access data from memory. Lower latency results in quicker data access and faster computations.\n",
"- **Impact**: Low-latency memory access reduces the time spent waiting for data to be available for computations, leading to improved processing speed.\n",
"- **Training vs. Inference**: Latency affects both training and inference. In training, frequent weight updates and gradient computations require low-latency access to data. In inference, quick responses are crucial for real-time applications.\n",
"\n",
"**Memory Size**:\n",
"- **Effect**: Memory size determines how much data can be stored. Larger memory allows for more data to be cached, reducing the need for frequent data movement.\n",
"- **Impact**: Sufficient memory size enables efficient caching of data and reduces the need for frequent memory access. It can lead to better performance by avoiding memory bottlenecks.\n",
"- **Training vs. Inference**: Both training and inference benefit from having enough memory to hold intermediate results and data. In training, large memory can accommodate gradients, activations, and model parameters. In inference, it allows for storing intermediate results during feedforward passes.\n",
"\n",
"**Training and Inference Differences**:\n",
"- In training, memory bandwidth is often crucial due to the frequent updates of weights and gradients during backpropagation. Batch processing amplifies memory bandwidth requirements.\n",
"- In inference, low latency and high memory bandwidth are important for real-time applications, but the focus might shift slightly towards optimizing for latency and response time.\n",
"\n",
"**Memory Hierarchy**:\n",
"- Modern architectures often have different memory levels with varying characteristics (e.g., cache, GPU memory, main memory). Optimizing data placement and movement across these levels is crucial for performance.\n",
"- Memory hierarchy can impact both training and inference by affecting how efficiently data is accessed and utilized.\n",
"\n",
"In summary, memory characteristics significantly influence neural network computation. High bandwidth, low latency, sufficient memory size, and efficient memory hierarchy are all essential for achieving optimal performance in both training and inference. While there might be nuances in how these aspects affect training and inference, addressing memory-related bottlenecks is crucial for overall efficiency and speed in deep learning computations."
]
},
{
"cell_type": "markdown",
"id": "0817e088-7115-4b37-8869-2dee9a3ce75b",
"metadata": {},
"source": [
"# 2. You are a chip designer and need to trade off computation and memory bandwidth. For example, a faster chip requires more power and possibly a larger chip area. More memory bandwidth requires more pins and control logic, thus also more area. How do you optimize?\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "9a5bcec9-b107-45bc-878a-fd8556dc449c",
"metadata": {},
"source": [
"Optimizing the trade-off between computation and memory bandwidth in chip design is a complex task that involves careful consideration of various factors. The goal is to achieve a balance between computation speed, memory access efficiency, power consumption, chip size, and other performance metrics. Here's a step-by-step approach to optimizing this trade-off:\n",
"\n",
"1. **Define Performance Goals**:\n",
" - Understand the specific requirements of the target application (e.g., real-time inference, batch training, energy efficiency) and define performance goals in terms of computation speed, memory access speed, and overall system efficiency.\n",
"\n",
"2. **Profile Workloads**:\n",
" - Analyze the workload characteristics of the application. Identify the key computational tasks, memory access patterns, data sizes, and communication frequencies between computation and memory.\n",
"\n",
"3. **Architectural Exploration**:\n",
" - Explore different architectural options to determine the optimal balance between computation units (cores, SIMD units, accelerators) and memory subsystems (cache hierarchy, memory banks).\n",
" - Consider trade-offs such as the number of cores, cache size, cache associativity, and memory hierarchy.\n",
"\n",
"4. **Memory Hierarchy Design**:\n",
" - Design memory hierarchies that provide efficient data access and minimize memory bottlenecks.\n",
" - Decide on cache levels, cache coherence protocols, memory types (SRAM, DRAM), and interconnect architectures.\n",
"\n",
"5. **Power Efficiency**:\n",
" - Optimize power consumption by considering techniques like dynamic voltage and frequency scaling (DVFS), clock gating, power gating, and energy-efficient memory access policies.\n",
"\n",
"6. **Chip Area and Integration**:\n",
" - Balance the chip area between computation and memory components based on performance requirements, available space, and cost constraints.\n",
" - Consider integrating memory close to computational units to reduce memory latency and improve memory bandwidth.\n",
"\n",
"7. **Memory Bandwidth Enhancement**:\n",
" - Explore techniques to enhance memory bandwidth, such as wide memory buses, high-speed memory interfaces (e.g., HBM, GDDR), and efficient memory controller designs.\n",
"\n",
"8. **Parallelism and Pipelining**:\n",
" - Incorporate parallelism and pipelining techniques to overlap computation and memory access, reducing the impact of memory latency on overall performance.\n",
"\n",
"9. **Simulation and Modeling**:\n",
" - Use simulation and modeling tools to evaluate different design choices and configurations.\n",
" - Analyze performance metrics, power consumption, and other relevant parameters for different scenarios.\n",
"\n",
"10. **Feedback Loop**:\n",
" - Iteratively refine the design by gathering feedback from simulations, prototypes, and benchmarks.\n",
" - Fine-tune the trade-offs based on the observed trade-offs between computation and memory bandwidth.\n",
"\n",
"11. **Validation and Testing**:\n",
" - Validate the optimized design through thorough testing and validation across various workloads and usage scenarios.\n",
"\n",
"12. **Real-World Constraints**:\n",
" - Consider real-world constraints such as manufacturing process technology, cost limitations, and time-to-market.\n",
"\n",
"Ultimately, the optimization process involves a careful consideration of performance, power, area, and cost factors. Collaboration among chip architects, designers, and domain experts is essential to make informed decisions and strike the right balance between computation and memory bandwidth."
]
},
{
"cell_type": "markdown",
"id": "23fd2c24-fafe-40dd-ab27-4eb260fbef14",
"metadata": {},
"source": [
"# 3. Why do engineers no longer report performance benchmarks on AlexNet?\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "a4b2db1c-2377-46b2-ae94-0e19b7309300",
"metadata": {},
"source": [
"1. **Aging Benchmark**: AlexNet, while pioneering, was introduced in 2012, and its architecture might not represent the state-of-the-art in terms of efficiency and accuracy compared to more recent models. Newer models, architectures, and techniques have been developed that surpass the performance of AlexNet on various tasks.\n",
"\n",
"2. **Advancements in Architecture**: Over the years, more advanced architectures like VGG, ResNet, Inception, and Transformer-based models (BERT, GPT, etc.) have been developed and have become more popular for benchmarking and research. These architectures often achieve better accuracy and efficiency than AlexNet.\n",
"\n",
"3. **Domain-Specific Models**: Depending on the application domain, engineers might prefer to benchmark models that are tailored to specific tasks. For instance, models like EfficientNet for efficient image classification or object detection networks like Faster R-CNN might be more relevant and commonly used for specific tasks.\n",
"\n",
"4. **Diverse Benchmarks**: With the increase in complexity and diversity of tasks, researchers often benchmark models across a wide range of datasets and tasks. This ensures that the performance of a model is tested across various scenarios rather than just focusing on a single benchmark.\n",
"\n",
"5. **Focus on Real-World Applications**: Engineers are increasingly interested in benchmarking models that demonstrate their performance in real-world applications, such as medical image analysis, autonomous driving, natural language understanding, and more. This shift in focus might result in a move away from using AlexNet.\n",
"\n",
"6. **Evolving Hardware and Software**: Performance benchmarks are often influenced by the underlying hardware (GPUs, TPUs) and software (deep learning frameworks, optimizations). As hardware and software landscapes evolve, engineers tend to benchmark newer models that are optimized for the latest hardware and software technologies.\n",
"\n",
"7. **Research Direction**: The field of deep learning research has expanded, and engineers are exploring various directions such as model interpretability, robustness, fairness, and more. These aspects might take precedence over revisiting older models like AlexNet for benchmarking.\n",
"\n",
"It's important to note that the above points are based on trends and developments up until September 2021. The field of deep learning is rapidly evolving, and practices may have changed since then. If you're looking for the most up-to-date information on performance benchmarks and research trends, I recommend checking recent conference proceedings, research papers, and benchmarks provided by organizations in the field."
]
},
{
"cell_type": "markdown",
"id": "b6be6392-0a50-4e6d-80da-ffb7d9723396",
"metadata": {},
"source": [
"# 4. Try increasing the number of epochs when training AlexNet. Compared with LeNet, how do the results differ? Why?\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "db418dfa-5dce-47e9-8208-6905056fb4de",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"KeyboardInterrupt\n",
"\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"