# ILILT: Implicit Learning of Inverse Lithography Technologies

# Haoyu Yang 1 Haoxing Ren 1

# **Abstract**

Lithography, transferring chip design masks to the silicon wafer, is the most important phase in modern semiconductor manufacturing flow. Due to the limitations of lithography systems, Extensive design optimizations are required to tackle the design and silicon mismatch. Inverse lithography technology (ILT) is one of the promising solutions to perform pre-fabrication optimization, termed mask optimization. Because of mask optimization problems' constrained non-convexity, numerical ILT solvers rely heavily on good initialization to avoid getting stuck on sub-optimal solutions. Machine learning (ML) techniques are hence proposed to generate mask initialization for ILT solvers with one-shot inference, targeting faster and better convergence during ILT. This paper addresses the question of whether ML models can directly generate high-quality optimized masks without engaging ILT solvers in the loop. We propose an implicit learning ILT framework: ILILT, which leverages the implicit layer learning method and lithography-conditioned inputs to ground the model. Trained to understand the ILT optimization procedure, ILILT can outperform the state-of-the-art machine learning solutions, significantly improving efficiency and quality.

# 1. Introduction

Lithography is the most important phase in semiconductor manufacturing flow, which transfers chip design patterns onto the silicon wafer. However, due to the limitations of lithography systems, there are usually fetal manufacturing gaps between the design and the patterns on the wafer. Extensive prior fabrication design optimizations, termed *mask optimization*, are required to compensate for the manufacturing loss (Figure 1).

Proceedings of the 41<sup>st</sup> International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).



Figure 1. Limitations of lithography system requires design optimization to be correctly fabricated on the silicon wafer.

Inverse lithography technology (ILT), as a numerical solution, has shown great promise for mask optimization on advanced-node lithography (Gao et al., 2014; Yu et al., 2021; Braam et al., 2019; Pang et al., 2019). As shown in Figure 2(a), an ILT solver requires frequent calls of lithography simulator, which mathematically approximate the lithography fabrication behavior. The forward path computes the wafer image corresponding to the current mask and we measure the error between the wafer image and the original design. In the backward path, the error will be passed back through gradient method to update the mask. However, the non-convexity of mask optimization problems poses a great challenge to numerical ILT solvers, that they rely heavily on good mask initialization and can easily get stuck at sub-optimal solutions.

Recent research into machine learning (ML) techniques for lithography have made ILT solvers more efficient. As shown in Figure 2(b), ML can provide better optimization starting points that significantly reduce ILT iterations and yield better convergence (Yang et al., 2018; Chen et al., 2020; Jiang et al., 2020). GAN-OPC (Yang et al., 2018) is the earliest deep learning solution for inverse lithography, where a conditional generative adversarial network is proposed for initial mask generation with litho-guided pre-training. DAMO (Chen et al., 2020) improves on this work by replacing the standard generator in GAN-OPC with nested UNet and ResNet bottleneck layers (Zhou et al., 2019; Isola et al., 2017; He et al., 2016), yielding a direct output of high-resolution masks. Jiang et al. (Jiang et al., 2020) also integrates the explicit gradient calculation of litho-guided pre-training and achieves better training convergence and mask quality. Very recently, Zhao et al. (Zhao et al., 2022) develops the AdaOPC framework that is able to learn from

<sup>&</sup>lt;sup>1</sup>Design Automation Research Group, NVIDIA, Austin, TX. Correspondence to: Haoyu Yang <a href="mailto:haoyuy@nvidia.com">haoyuy@nvidia.com</a>>.



Figure 2. Working scheme of ILT solvers. (a) Standard ILT solver that iteratively updates the mask from design till convergence. (b) Common ML approach that uses ML to produce an initial mask followed by standard ILT procedure. (c) The ILILT that uses ML to imitate the standard ILT procedure.

layout geometry and identifies whether a pattern can be optimized by the machine learning engine. This further increase the applicability of ML for mask optimization solutions.

However, existing solutions only exploit limited ML in the ILT solver and still require significant intervention by traditional numerical solvers. For example, GAN-OPC and its variations can only learn a design-to-mask mapping, which cannot be used to fix poor-quality masks; AdaOPC (Zhao et al., 2022) has to conduct full numerical optimizations on complicated patterns. In this paper, we try to address the question of whether an ML model can directly generate high quality optimized masks without engaging ILT solvers in the loop. We believe that the key limitation of existing methods is that the trained ML models are not grounded to any actual lithography condition during inference, therefore training-inference distribution shift can not be recognized and corrected. We argue that ML models grounded on lithography estimation of intermediate masks can make an ML-based ILT framework more effective.

We propose the ILILT (Figure 2(c)) framework to formulate mask optimization as an implicit layer learning problem, trained to optimize the mask towards an equilibrium state (Bai et al., 2019). Like a standard ILT solver that iteratively updates masks from a design, ILILT iteratively modifies masks from a design but with much fewer steps and each step runs significantly faster because no gradient computing is required. This process produces a number of intermediate masks which ground the ML model with intermediate wafer images computed by lithography models.

**ILT As An Implicit Layer**. Unlike common neural network layers that explicitly define specific computation graphs supporting forward and backward propagation, an

implicit layer models a joint condition of its input and output (Jittorntrum, 1978). The nature of ILT makes it a representative example of an implicit layer, where the input is a chip design and the output is the corresponding optimized mask that results in a similar wafer pattern compared to the input after the lithography process. Inspired by the implicit layer learning approach, we design ILILT with a weight-tied sequence model, making it possible to learn an optimization trajectory from training design-mask pairs. To make the trajectory learning grounded with physics, we also integrate ILILT with fast lithography estimator that can implicitly apply lithography conditions on top of the optimization procedure. After training, ILILT is able to perform mask optimization on a given design and reaches a stable state in a few iterations. We will later show how ILILT outperforms a state-of-the-art academic ILT engine with technical details and supporting experiments on public chip design benchmarks.

The reminders of the paper are organized as follows: Section 2 covers basic terminologies and related works of mask optimization; Section 3 discusses and analyzes the technique details of the ILILT framework; Section 4 presents the experimental results supporting the effectiveness of ILILT and Section 5 concludes the paper.

#### 2. Preliminaries

#### 2.1. ILT Solver

Mask optimization tries to find a mask design such that the remaining pattern on the silicon wafer after the lithography process is as close as possible to the desired shapes. This can be formulated as a constrained optimization problem:

$$\min_{\boldsymbol{M}} l = d(\boldsymbol{Z}, \boldsymbol{Z}^*), \tag{1}$$

where,

$$Z = f_l(M), (2)$$

where Z and  $Z^*$  represent the wafer image and the chip design, respectively, M corresponds to the mask to be optimized,  $f_l$  represents the lithography process which is highly non-convex (see appendix for more details) and d is a measurement indicating the differences between Z and  $Z^*$ . Inverse lithography techniques (Gao et al., 2014) try to solve Problem (1) in a numerical way by repeating the following steps until convergence:

1. Forward:  $Z_t = f_l(M_t)$ .

2. Backward: 
$$\boldsymbol{M}_{t+1} = \boldsymbol{M}_t - \lambda \frac{\partial l}{\partial \boldsymbol{Z}_t} \frac{\partial \boldsymbol{Z}_t}{\partial \boldsymbol{M}_t}$$
.

where  $\lambda$  is some hyper-parameter that controls how the mask image is updated during the optimization procedure. There

are also variations of ILTs (Yu et al., 2021; Jiang et al., 2020; Gao et al., 2014), but they mostly share similar mask-update concepts.

#### 2.2. Implicit Layers

Most neural network layers or computational graphs explicitly define a function y = f(x), which is differentiable and can be back-propagated via the chain rule. An implicit layer, instead of specifying how to compute the layer's output from the input, specifies the conditions that we want the layer's output to satisfy, separating the procedure to compute the layer from the layer definition itself. It has been used in deep equilibrium models, Neural ODEs, and differentiable optimizations.

**Definition 2.1** (Implicit Layer). Implicit layers try to find y such that  $f(x, y) \in \mathcal{C}$ , which defines joint conditions between x and y (Jittorntrum, 1978).

A fixed-point layer is an example of the implicit layer method, which directly computes the fixed point of a nonlinear transformation, i.e., the solution to the nonlinear system which models the convergent state of a recurrent sequence.

**Definition 2.2** (Fixed-Point Layer). A fixed-point layer can be mathematically given by

$$y = g(x, y). (3)$$

A fixed-point layer can be considered to be an infinite number of layers parameterized by the same weight w (weighttied), where each layer can be represented as

$$\boldsymbol{y}_{t+1} = g(\boldsymbol{x}, \boldsymbol{y}_t, \boldsymbol{w}), \tag{4}$$

To train a fixed-point layer, (Bai et al., 2019) proposes an unrolling-free deep equilibrium (DEQ) method, which trains the fixed-point layer in two alternative steps: (1) solve for the fixed-point of the layer and (2) update network parameters. This process consists of repeated forward and backward computations as follows. The forward path is related to the finding of the fixed point of Equation (3) and can be solved with Newton's method. Let

$$h(\boldsymbol{x}, \boldsymbol{y}^*, \boldsymbol{w}) = g(\boldsymbol{x}, \boldsymbol{y}^*, \boldsymbol{w}) - \boldsymbol{y}^*, \tag{5}$$

and  $y^*$  can be found through the following step until convergence:

$$\mathbf{y}_{t+1} = \mathbf{y}_t - \lambda \mathbf{J}_t^{-1} h(\mathbf{x}, \mathbf{y}_t, \mathbf{w}), \tag{6}$$

where  $J_t$  is the Jacobian matrix of h at  $y_t$  and  $\lambda$  is the update rate.

The backward process covers the update of w, which can be written as

$$w = w - \lambda \frac{\partial l}{\partial w} = w + \lambda \frac{\partial l}{\partial u^*} J^{*-1} \frac{\partial g}{\partial w},$$
 (7)

where  $J^*$  is the Jacobian matrix of h at  $y^*$ , l is a loss function that measures whether  $f(x,y) \in \mathcal{C}$  satisfies the implicit layer definition, e.g. the ILT loss Equation (1). For a detailed proof of Equation (7), we refer readers to (Bai et al., 2019).

# 2.3. Neural Networks for ML-based Computational Lithography

There are two major groups of network architectures used in lithography applications: (1) The UNet family (Ronneberger et al., 2015; Zhou et al., 2019) and (2) The FNO family (Li et al., 2021). UNets consist of various bypass connections between convolution layer outputs. Representative works include TEMPO (Ye et al., 2020), LithoGAN (Ye et al., 2019), DAMO (Chen et al., 2020) and GAN-OPC (Yang et al., 2018). FNOs contain one or more Fourier Neural Operators along with auxiliary convolution layers. Representative works are DOINN (Yang et al., 2022b) and CFNO (Yang et al., 2022a). UNet-based models usually rely on large model capacity to learn training data distribution while FNO-based models focus on global information learning through frequency domain embedding.

#### 3. The ILILT Framework

#### 3.1. ILT Layer

Following the definition of implicit layer learning, we can solve the ILT equations with an implicit ILT layer defined as finding  $\boldsymbol{M}$  such that

$$f(\boldsymbol{M}, \boldsymbol{Z}^*) = d(f_l(\boldsymbol{M}), \boldsymbol{Z}^*) = \epsilon, \tag{8}$$

where  $\epsilon$  is some minimum value achievable for Equation (1).

Because Equation (8) is a composite of a group of non-linear functions, it is difficult to solve explicitly. Therefore, we assume Equation (8) has a solution  $\boldsymbol{M}^*$  that follows the form of

$$\boldsymbol{M}^* = q(\boldsymbol{M}^*, \boldsymbol{Z}^*), \tag{9}$$

which yields a fixed-point layer representation of the ILT solution.

We start solving Equation (9) from the weight-tied approach, where we define

$$M_{t+1} = g(M_t, Z^*, w), t = 1, 2, ...,$$
 (10)

and

$$\lim_{t \to \infty} \boldsymbol{M}_t = \lim_{t \to \infty} g(\boldsymbol{M}_t, \boldsymbol{Z}^*, \boldsymbol{w}) = g(\boldsymbol{M}^*, \boldsymbol{Z}^*, \boldsymbol{w}) = \boldsymbol{M}^*.$$
(11)

#### 3.2. Lithography Estimation

As we discussed previously, due to the lack of prior lithography knowledge, state-of-art AI approaches have been lim-

ited to only generate initialization for numerical ILT solvers. This motivates us to build lithography estimation inside g to ground the AI model.

We rewrite Equation (10) as:

$$M_{t+1} = g(M_t, Z_t, Z^*, w), t = 1, 2, ...,$$
 (12)

where  $Z_t = f_l(M_t)$  is the wafer image of  $M_t$  at time stamp t. Equation (12) formulates the core of the ILILT framework. A virtue of Equation (12) is the embedded condition of the lithography estimator that implicitly tells g the physical meaning of each step.

#### 3.3. Training

Equation (12) defines a weight-tied fixed-point model that can be solved with the DEQ method. Although DEQ (Bai et al., 2019) reduces the computing overhead of sequence unrolling and back-propagation through time, there are several concerns when it's used for solving ILT problems:

- Solving M\* through Newton's method or other Root-Finding algorithms is still computationally costly considering that M\* usually has millions of entries.
- The solution error caused by the non-optimality in Newton's method will further propagate to the gradient calculation in backward DEQ, inducing additional errors.
- The DEQ backward pass still requires the computation
  of the gradient over lithography modeling in the term
   \[
  \frac{\partial d}{\partial M^\*}, \] showing no advantage over a traditional ILT solver.

To address these concerns, we developed the following training architecture that solves the ILT layer efficiently and effectively. We unroll ILILT to a fixed depth T and allow the final output  $M_T$  to be trained towards a ground truth mask  $M^*$ ,

$$\min_{\mathbf{w}} l = \sum_{t=\left[\frac{T}{2}\right]}^{T} \exp\left[\frac{t}{T} - 1\right] \cdot ||\mathbf{M}_{t} - \mathbf{M}^{*}||_{F}^{2},$$
 (13)

s.t. 
$$M_{t+1} = g(M_t, Z_t, Z^*, w), t = 0, 1, ..., T - 1,$$
  
 $Z_t = f_l(M_t), t = 0, 1, ..., T,$ 

where t indicates each unrolling step.

It should be noted that to encourage the mask to grow towards and stop at the golden solution, we did not adopt the cost function that directly minimizes the difference between  $M_T$  and  $M^*$ . Instead, we ask intermediate masks in later unrolling stages to have a trend growing toward the ground truth, which is reflected in Equation (13) having masks at

each time step  $M_t$  trained towards the ground truth  $M^*$ . The coefficient  $\exp[\frac{t}{T}-1]$  applied at each loss term is to let the training process be aware of the progressive growth of masks at each time step. This trick allows the mask to grow smoothly in these steps while stopping at the fixed point. The exponential term is empirically chosen to regularize the mask-growing trends which are consistent with the behavior of the numerical ILT solver that only minor changes are observed in later optimization iterations. Solving Equation (13) is straightforward through back-propagation through time (BPTT).

#### 3.4. Discussion

#### 3.4.1. RELATIONS TO ML-AIDED SOLUTIONS

For the training procedure, we observe that ILILT resembles GAN-OPC (Yang et al., 2018) in terms of the training objectives, where we also force the model to learn a golden mask generated by some conventional methods. However, ILILT advances beyond GAN-OPC works (Yang et al., 2018; Jiang et al., 2020; Zhao et al., 2022; Chen et al., 2020) because it not only learns what a good mask should look like, but also **why and how a good mask is grown** thanks to the implicit lithography condition introduced by  $f_l(\cdot)$ .

#### 3.4.2. RELATIONS TO IMPLICIT LAYER LEARNING

The whole ILILT method is built upon the definition of the ILT layer and how it is solved. Representative solutions, e.g. DEQ (Bai et al., 2019), still requires costly computation of gradients of the objectives over physical models. ILILT makes it **even more implicit** by feeding the wafer image, target design and the mask image simultaneously into  $g(\cdot, \boldsymbol{w})$  at each time stamp. Finally  $g(\cdot, \boldsymbol{w})$  will automatically figure out the inherent relations between masks, designs and resist patterns.

# 3.4.3. RELATIONS TO LEARNING-TO-OPTIMIZE

The ILILT system is a similar scheme to learning-to-optimize (L2O) solutions (Chen et al., 2021), particularly the algorithm unrolling approach, which unrolls a truncated algorithm into multi-stage neural networks with independent parameters. In contrast, ILILT benefits from the **weight-tied structure** that (1) significantly reduces model size for memory efficiency and (2) generalizes better (Bai et al., 2019).

# 3.4.4. PRACTICAL USAGE SCENARIO

Lastly, ILILT imitates a real ILT solver during the inference runtime but with many benefits: (1) ILILT requires to query a lithography simulator in each step but updates the mask in a derivative-free manner. (2) ILILT requires significant less steps than traditional ILT solver. (3) ILILT has the opportunity to outperform numerical ILT solvers due to strong mask







(a) Design

(b) Mask

(c) Wafer

Figure 3. Data example from LithoBench (Zheng et al., 2023). (a) Design image contains the patterns of original circuit devices or connectivity; (b) Mask image is the optimized design that is manufacturing-friendly; (c) Wafer image is the patterns on silicon given the mask image.

priors learned from broad data points. (4) ILILT can serve as a real solver during chip design optimization. One representative case is when a design is not well-optimized by an existing mask optimizer. Because most AI solutions only learn a design-to-mask mapping, it is not possible to use them to fix partially optimized masks. ILILT, however, does not have this limitation, and can take over from anywhere in the optimization trajectory.

# 4. Experiments

#### 4.1. The Dataset and Configurations

To demonstrate the effectiveness of the ILILT framework, we adopt the latest AI for computational lithography benchmark suite LithoBench (Zheng et al., 2023), where stateof-the-art mask optimization solutions are evaluated and will be used as comparison baselines. LithoBench contains over 100K design clips and each covers an area of  $2\mu m \times 2\mu m$ . Each data point contains a pair of chip design patterns and the ground truth masks generated by a numerical ILT solver. These design patterns and masks are rasterized with a  $1nm^2$ pixel size, yielding 2048 × 2048 images. Example designs are displayed in Figure 7. We can observe that in real world, mask optimization does not simply finetune the design like in Figure 1. The process also produce sub-resolution assist features (SRAFs) surrounding the main design pattern to improve its robustness against process variations. This poses great challenge of ML models on optimization tasks.

To deploy ILILT, we re-implement the lithography simulator as a pytorch layer that can be integrated in the data pipeline. Detailed configurations of the generator used to reproduce our results are listed in Table 1. Here we adopt CFNO (Yang et al., 2022a) as the backbone network for g to evaluate ILILT.

#### 4.2. Evaluation Metrics

As is typical for mask optimization tasks, we evaluate the final performance through the metrics that are widely adopted

Table 1. Model Configuration.

| Item                           | g              |  |  |
|--------------------------------|----------------|--|--|
| Backbone                       | CFNO/Pix2PixHD |  |  |
| Max Epoch                      | 5              |  |  |
| Initial Learning Rate          | 0.004          |  |  |
| Learning Rate Decay Policy     | step           |  |  |
| Optimizer                      | Adam           |  |  |
| Weight Decay                   | 0.0001         |  |  |
| Loss                           | Equation (13)  |  |  |
| Sequence Unrolling Depth $(T)$ | 4-8            |  |  |
| Batch Size                     | 4              |  |  |



Figure 4. Lithography edge placement error and process variations. Smaller EPE and PVB area indicates better optimization QoR.

in the chip manufacturing industry: edge placement error (EPE) violations and the process variation band (PVB) area. In semiconductor manufacturing processes, the lithography conditions are not always as we expected due to systematic error and mechanical perturbations, which is the root cause of process variations. Therefore, we expect the design to be robust under different manufacturing conditions. Typically, we call the expected lithography settings and configurations nominal condition, and non-nominal condition will result in larger or smaller patterns on the silicon wafer. The difference between the smallest pattern and the largest pattern will form a process variation band (PVB) as in Figure 4. A good mask optimization result should have a smaller number of EPE violations and a smaller PVB area.

**Definition 4.1** (EPE Violation(Banerjee et al., 2013)). EPE is measured as the geometric distance between the target edge and the lithographic contour printed at the nominal condition. If the EPE measured at a point is greater than a certain tolerance value, we call it an EPE violation. In our experiments, the EPE measuring points are sampled at 40nm interval.

**Definition 4.2** (PVB Area(Banerjee et al., 2013)). This is evaluated by running lithography simulations at different process conditions on the final mask solution. Once run, the total area of the process variation band will be defined as PVB Area.



Figure 5. Visualization of the optimization trajectory beyond the maximum unrolling depth. We use the model trained with unrolling step T=4, and mask stays stable after the 4th step with only minor tuning. (a)–(h) correspond to the generated masks generated from time stamps 1–8. (i)–(p) are the lithography simulated wafer image with their corresponding EPE violation count.



Figure 6. Mask prior learning from ILILT (P2PHD). Due to design pattern variations, the numerical solver sometimes fails to produce the correct solution. In this example, ILT generates rule-violating artifacts and performs feature shrinking excessively, causing the missing of design patterns as in (c). While ILILT can learn from the broader dataset and make correct predictions. For a more complicated case (f)-(j), ILILT can successfully produce optimized mask that can pass lithography printability check while numerical ILT still exhibits EPE violations in the printed wafer image.

# 4.3. Comparison with State-of-the-Art

In the first experiment, we compare the performance of ILILT with state-of-the-art mask optimizers. A quantitative comparison is listed in Table 2. Columns "EPE", "PVB" and "Throughput (s/tile)" indicate the count of EPE violations, PVB Area in terms of  $nm^2$  and optimization throughput, respectively; "GPU-ILT" (Sun et al., 2023) is the latest academic numerical ILT engine with GPU acceleration, which will also be the engine used to generate the training data; "GAN-OPC" (Yang et al., 2018) is one of the early attempts using machine learning for mask optimization following the scheme in Figure 2(b). GAN-OPC employs UNet as a generator backbone. "Pix2PixHD" (Wang et al., 2018) is another representative CNN-based backbone that enjoys better gen-

eralization capability for high-resolution image generation, which fits design-mask translation task well. "Neural-ILT" (Jiang et al., 2020) is a follow-up of GAN-OPC, which adopts a true lithography simulator to produce the gradient for generator training. "CFNO" (Yang et al., 2022a) is an FNO-based design that preserves the global perception of the Fourier neural operator while saving computation with ViT's patch embedding. CFNO and Pix2PixHD will also be the backbone frameworks of the ILILT generator, and we will show the advantage of the ILILT data pipeline over the standalone generator. ILILT-X lists the results for X unrolling steps.

Overall, the ILILT with Pix2PixHD backbone achieves the smallest EPE violation count among all other ML-based

Table 2. Comparison with State-of-the-Arts.

| Method                          | EPE  | PVB  | Throughput |
|---------------------------------|------|------|------------|
| GPU-ILT (Sun et al., 2023)      | 0.21 | 4656 | 4          |
| GAN-OPC (Yang et al., 2018)     | 8.3  | 6686 | 0.008      |
| Neural-ILT (Jiang et al., 2020) | 6.2  | 8537 | 10         |
| Pix2PixHD (Wang et al., 2018)   | 0.12 | 4685 | 0.1        |
| CFNO (Yang et al., 2022a)       | 0.51 | 4623 | 0.06       |
| L2O-CFNO-8-Step                 | 1.59 | 4436 | 0.6        |
| L2O-P2PHD-8-Step                | 0.87 | 4526 | 0.5        |
| ILILT-CFNO-2-130K (ours)        | 0.39 | 4704 | 0.1        |
| ILILT-CFNO-4-130K (ours)        | 0.39 | 4667 | 0.2        |
| ILILT-CFNO-6-130K (ours)        | 0.35 | 4657 | 0.3        |
| ILILT-CFNO-8-130K (ours)        | 0.25 | 4674 | 0.4        |
| ILILT-P2PHD-2-45M (ours)        | 0.13 | 5056 | 0.2        |
| ILILT-P2PHD-4-45M (ours)        | 0.13 | 4680 | 0.4        |
| ILILT-P2PHD-6-45M (ours)        | 0.11 | 4670 | 0.6        |
| ILILT-P2PHD-8-45M (ours)        | 0.08 | 4695 | 0.8        |

solutions with 0.08 average EPE violations (achieved by T=8), which outperforms the golden numerical solver. ILILT-CFNO though behaves slightly worse than ILILT-P2PHD, it can still exhibit comparable results with "GPU-ILT" with a  $400\times$  smaller model size. All methods do not show a significant advantage on PVB area, and this is because the optimization flow in these solutions do not explicitly target process variations. Lastly, the ILILT already achieves better mask quality than the other three ML-based solutions without the refinement from the legacy ILT engine. We also show a  $10\times$  speedup over the numerical ILT solver.

The ILILT framework also ensembles the beneath idea of learning-to-optimize (L2O), which unrolls optimization steps on multiple layers of neural networks whose weights are not shared. As a comparison, we also implement two standard L2O flows with CFNO and Pix2PixHD as backbones respectively. We choose the 8 unrolling steps that have similar computing costs to the best ILILT settings. As shown in Table 2, L2O does not preserve the performance of standalone backbone models with a significant increase in EPE violations. L2O unrolls the ILT problem into a fixed number of optimization steps, which are parameterized by different neural network layers. Such a model, with 8× trainable parameters in our case, requires more converging training efforts. Specifically, the error of each optimization step will propagate to the next level which causes performance degradation compared to standalone backbone models.

#### 4.4. Generalization of ILILT

ILILT also shows better generalization capability thanks to its weight-tied structure. This can be observed in the superior performance boosting over standalone CNN and Neural Operator-based backbone models. We can observe that with the weight-tied sequence training on CFNO and Pix2PixHD, the mask quality has an improvement of 50% and 33% reduced EPE violation. Although the unrolled sequence induces additional runtime, the model inference time can be ignored when compared to the entire physical implementation and manufacturing flow.

# 4.5. Self-learning Capability

Usually, ML-based solutions rely highly on the dataset quality, and hence the performance of traditional ILT solvers. However, ILT is naturally non-convex and suffers from suboptimal solutions. Under such a scenario, ILILT can take advantage of its capability to outperform numerical solvers in certain chip designs, providing faster iteration for dataset updates and serving as a powerful auxiliary candidate for traditional numerical solvers.

# 4.6. ILILT Optimization Flow

The learned mask optimization process is depicted in Figure 5, where we visualize the growth of a mask along the unrolled sequence for a given design. We bring the model trained with unrolling step T=4 but keep the loop running until T=8. We show that the mask reaches the best quality at the configured unrolling depth. Interestingly, the ILILT also learns a fixed-point iteration such that if we continuously feed the mask beyond the unrolling depth, the model will no longer make significant changes to the mask except minor adjustments. At T=8, the ILILT engine tries to remove the rule-violating artifact from the middle-bottom region. This property brings more application scenarios of the ILILT framework that include producing initial solutions and performing fine-grained fixes at the post-optimization stage. Another benefit of ILILT is its stable optimization trajectory, because quality fluctuation is common during the optimization runtime of numerical ILT solvers (Gao et al., 2014; Ma et al., 2017; Yu et al., 2021; Sun et al., 2023), causing additional overhead to identify good solutions.

## 4.7. Investigation of Unrolling Depth

Here we discuss the effects of the maximum unrolling depth. The ILILT is trained using different max unrolling depths (T) with  $T \in \{2,4,6,8\}$ . With PVB area being similar, we observe better results happen in larger unrolling steps in the configuration. When T is small, ILILT does not benefit too much from the weight-tied structure and exhibits larger

EPE violation counts. Instead, a larger T will put stronger regularization on the model and enable temporal feature learning to increase the model generality. However, a longer unrolling sequence will linearly increase the computing cost as well as training efforts. In this paper, we experimentally show the performance boosting saturates at T=8 for a fixed number of training epochs.

## 4.8. Mask Prior Learning

In our experiments, we also observe an interesting behavior of ILILT. For some cases, ILILT can even produce better QoR than numerical solution. An example is shown in Figure 6, where the ILILT produces mask solution with zero EPE violations while numerical solver comes with four. This is because ILILT not only learns the mask optimization trajectory from the golden numerical solutions but also captures the mask solution from the broader dataset, which makes the trained network equip deep mask priors. This can also be observed in the generated mask images that ILILT avoids generating rule-violation isolated artifacts while numerical ILT does (see small white dots at the bottom of Figure 6(g)). The result can also be explained through deep image prior (Ulyanov et al., 2018), that during training neural networks capture the structural knowledge with priority. ILILT strengthens this prior through weigh-tied unrolling and lithography guidance.

#### 5. Conclusion

In this paper, we propose a novel solution to the mask optimization problem with inverse lithography technology. We argue that existing AI-based approaches are overly reliant on conventional numerical solvers and lack forward lithography conditions, limiting the applicability of AI-based approaches to mask optimization scenarios. To further exploit the potential of AI and machine learning, we carefully analyze the characteristics of the ILT procedure and reformulate it as an implicit layer learning problem. Instead of applying the end-to-end DEQ method, we solve the ILT layer through ILILT, a weight-tied model, where a forward lithography estimator provides lithography information in the optimization procedure. Experiments demonstrate the effectiveness of ILILT learning an ILT layer with limited intervention of a traditional numerical solver. To the best of our knowledge, this is the first time an AI solution has been proposed to imitate the working scheme of a traditional ILT system. We hope this work can further motivate research into expanding the potential of AI for solving complex design automation problems. Additional investigations on production-level designs are necessary to prototype the framework to study whether QoR meets manufacturing standards.

# **Impact Statement**

Lithography is the very first topic in chip design community brought into applied deep learning. Because design layouts, masks and wafer patterns are naturally in the format of images which can benefit from many backbones and algorithms, though domain adaptions are necessary. Early efforts only fall into forward predictive tasks like hotspot detection or lithography simulation. There is limited research dealing with chip optimization problems that can directly benefit semiconductor manufacturing. This work, ILILT, we believe is the first effort to solve inverse lithography/mask optimization directly using machine learning without the intervention of traditional numerical solvers. Hopefully, this can motivate new research strategies on how is AI included in chip design and manufacturing flow, and the advancement of chip design can feedback and benefit AI in the long run.

#### References

- Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium models. In *Conference on Neural Information Processing Systems (NIPS)*, 2019.
- Banerjee, S., Li, Z., and Nassif, S. R. ICCAD-2013 CAD contest in mask optimization and benchmark suite. In *IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*, pp. 271–274, 2013.
- Braam, K., Selinidis, K., Hoppe, W., Cai, H., and Xiao, G. EUV mask synthesis with rigorous ILT for process window improvement. In *Proceedings of SPIE*, volume 10962, 2019.
- Chen, G., Chen, W., Ma, Y., Yang, H., and Yu, B. DAMO: Deep agile mask optimization for full chip scale. In *IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*, 2020.
- Chen, T., Chen, X., Chen, W., Heaton, H., Liu, J., Wang, Z., and Yin, W. Learning to optimize: A primer and a benchmark. *arXiv preprint arXiv:2103.12828*, 2021.
- Gao, J.-R., Xu, X., Yu, B., and Pan, D. Z. MOSAIC: Mask optimizing solution with process window aware inverse correction. In ACM/IEEE Design Automation Conference (DAC), pp. 52:1–52:6, 2014.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 770–778, 2016.
- Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-toimage translation with conditional adversarial networks. In *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 1125–1134, 2017.

- Jiang, B., Liu, L., Ma, Y., Zhang, H., Young, E. F. Y., and Yu, B. Neural-ILT: Migrating ILT to nerual networks for mask printability and complexity co-optimizaton". In IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2020.
- Jittorntrum, K. An implicit function theorem. *Journal of Optimization Theory and Applications*, 25(4):575–577, 1978.
- Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., and Anandkumar, A. Fourier neural operator for parametric partial differential equations. *International Conference on Learning Representations* (ICLR), 2021.
- Ma, Y., Gao, J.-R., Kuang, J., Miao, J., and Yu, B. A unified framework for simultaneous layout decomposition and mask optimization. In *IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*, pp. 81–88, 2017.
- Pang, L., Russell, E. V., Baggenstoss, B., Lee, M., Digaum, J., Yang, M.-C., Ungar, P. J., Bouaricha, A., Wang, K., Su, B., et al. Study of mask and wafer co-design that utilizes a new extreme simd approach to computing in memory manufacturing: full-chip curvilinear ilt in a day. In *Proceedings of SPIE*, volume 11148, pp. 136–151, 2019.
- Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In *International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)*, pp. 234–241, 2015.
- Sun, S., Yang, F., Yu, B., Shang, L., and Zeng, X. Efficient ilt via multi-level lithography simulation. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE, 2023.
- Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 9446–9454, 2018.
- Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In *IEEE Conference on Computer Vision and Pattern Recognition* (*CVPR*), pp. 8798–8807, 2018.
- Yang, H., Li, S., Ma, Y., Yu, B., and Young, E. F. GAN-OPC: Mask optimization with lithography-guided generative adversarial nets. In *ACM/IEEE Design Automation Conference (DAC)*, pp. 131:1–131:6, 2018.

- Yang, H., Li, Z., Sastry, K., Mukhopadhyay, S., Anand-kumar, A., Khailany, B., Singh, V., and Ren, H. Large scale mask optimization via convolutional fourier neural operator and litho-guided self training. *arXiv preprint arXiv:2207.04056*, 2022a.
- Yang, H., Li, Z., Sastry, K., Mukhopadhyay, S., Kilgard, M., Anandkumar, A., Khailany, B., Singh, V., and Ren, H. Generic lithography modeling with dual-band opticsinspired neural networks. In ACM/IEEE Design Automation Conference (DAC), 2022b.
- Ye, W., Alawieh, M. B., Lin, Y., and Pan, D. Z. LithoGAN: End-to-end lithography modeling with generative adversarial networks. In *ACM/IEEE Design Automation Conference (DAC)*, pp. 107:1–107:6, 2019.
- Ye, W., Alawieh, M. B., Watanabe, Y., Nojima, S., Lin, Y., and Pan, D. Z. Tempo: Fast mask topography effect modeling with deep learning. In ACM International Symposium on Physical Design (ISPD), pp. 127–134, 2020.
- Yu, Z., Chen, G., Ma, Y., and Yu, B. A GPU-enabled level set method for mask optimization. In *IEEE/ACM Proceedings Design*, *Automation and Test in Eurpoe (DATE)*, 2021.
- Zhao, W., Yao, X., Yu, Z., Chen, G., Ma, Y., Yu, B., and Wong, M. D. AdaOPC: A self-adaptive mask optimization framework for real design patterns. In *IEEE/ACM International Conference on Computer-Aided Design (IC-CAD)*, 2022.
- Zheng, S., Yang, H., Zhu, B., Yu, B., and Wong, M. D. Lithobench: Benchmarking ai computational lithography for semiconductor manufacturing. In *Conference on Neural Information Processing Systems (NIPS)*, 2023.
- Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., and Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. *IEEE transactions on medical imaging*, 39(6):1856–1867, 2019.

# A. Lithography Simulation Model

Throughout the manuscript, we use  $f_l$  to represent the entire lithography process. In detail, this can be broken into two stages: (1) optical modeling and (2) resist modeling. The optical modeling tries to find the illumination intensity after the mask and is projected on the silicon wafer. This process can be mathematically written as,

$$I = \sum_{k=1}^{N} \alpha_k || \boldsymbol{h}_k \otimes \boldsymbol{M} ||^2, \tag{14}$$

where I is the light intensity projected on the silicon wafer, M denotes the mask image, and  $h_k$ 's and  $\alpha_k$ 's are lithography system-related parameters that are calibrated in foundries.  $h_k$ 's and  $\alpha_k$ 's are different under different process conditions.

During resist modeling, I will have a chemical reaction with some "resist materials" and etch chip design on the top of the silicon wafer, which is given by

$$\mathbf{Z}(i,j) = \begin{cases} 1, & \text{if } \mathbf{I}(i,j) > I_{\text{th}}, \\ 0, & \text{otherwise.} \end{cases}$$
 (15)

In this paper,  $I_{\rm th}=0.225$ , which is determined by the technology profile.



Figure 7. Lithography process. (a) Mask image contains the patterns of optimized design; (b) Intensity image is the UV lights that are projected on the silicon wafer given by Equation (14); (c) Wafer image is the patterns on silicon given Equation (15).

# **B. Numerical ILT**

Because numerical solver usually requires differentiable objectives. In practical, we relax  $M \in \{0,1\}$  and  $Z \in \{0,1\}$  through sigmoid.

$$M = \frac{1}{1 + \exp(-\beta_m(M' - 0.5))},\tag{16}$$

where M' is some auxiliary variable that is usually initialized as the original chip design  $Z^*$ .

$$Z = \frac{1}{1 + \exp(-\beta_z(\boldsymbol{I} - I_{\text{th}}))}.$$
 (17)

And now we can solve the following ILT problem through gradient descent.

$$\min_{M,M'} l = ||Z - Z^*||$$
s.t. Equations (14), (16) and (17).

It should be noted that numerical ILT requires frequent calls of time-consuming lithography simulation.

# C. CFNO

CFNO, the convolutional Fourier Neural Operator (Yang et al., 2022a), was built upon the original FNO (Li et al., 2021) design. The capability of FNO learning frequency domain features fits the lithography well. However, the original FNO design has a few challenges: (1) FNO requires computations of FFT of the entire input mask image which is computationally



Figure 8. CFNO backbone design.



Figure 9. The data pipeline of g used throughout experiments.

costly; (2) FNO, once trained, can only be applied on the same-sized image. CFNO addresses these problems by combining the patch embedding in ViT. The core idea is to separate the input into non-overlapped patches, and each patch will go through a shared FNO layer as in Figure 8. Then we perform a token-wise convolution to learn the relations among different patches and finally form the mask embedding. The overall architecture of g is shown in Figure 9.

#### D. Pix2PixHD

Pix2PixHD (Wang et al., 2018) is an improved version of Pix2Pix which employs hierarchical local enhancers for high-resolution image generation. To make the framework fit more on the mask optimization problem we choose  $ngf = 16, n\_downsample\_global = 4, n\_blocks\_global = 9$  as defined in https://github.com/NVIDIA/pix2pixHD. We also make certain improvements by replacing deconvolution layers with bicubic interpolation and stride-1 convolution layers for smooth shape generation. We also apply  $4\times4$  downsampling (achieved by average pooling) at the input directly to reduce computing overhead and perform interpolation to cast the image back to desired resolution. This modification yields 45M model as listed in Section 4.

# E. Additional Visualization



(a) AES-Design



Figure 10. Visualization sample of AES chip design for ILILT and L2O with CNN and Neural Operator backbones. L2O with Pix2PixHD backbone even produces meaningless artifacts.