From d37407b160b755a8816ece475cfff78fad172ac7 Mon Sep 17 00:00:00 2001 From: Piotr Gorski Date: Mon, 17 Aug 2020 22:48:47 +0200 Subject: [PATCH 01/18] net/sched: allow configuring cake qdisc as default Signed-off-by: Piotr Gorski --- net/sched/Kconfig | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 1e8ab4749..75122fd65 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -474,6 +474,9 @@ choice config DEFAULT_SFQ bool "Stochastic Fair Queue" if NET_SCH_SFQ + config DEFAULT_CAKE + bool "Common Applications Kept Enhanced" if NET_SCH_CAKE + config DEFAULT_PFIFO_FAST bool "Priority FIFO Fast" endchoice @@ -485,6 +488,7 @@ config DEFAULT_NET_SCH default "fq_codel" if DEFAULT_FQ_CODEL default "fq_pie" if DEFAULT_FQ_PIE default "sfq" if DEFAULT_SFQ + default "cake" if DEFAULT_CAKE default "pfifo_fast" endif -- 2.34.1.75.gabe6bb3905 From 84f3b5a40b50f8008e0a383020abc1724f71ad31 Mon Sep 17 00:00:00 2001 From: "Jan Alexander Steffens (heftig)" Date: Fri, 26 Oct 2018 11:22:33 +0100 Subject: [PATCH 02/18] infiniband: Fix __read_overflow2 error with -O3 inlining --- drivers/infiniband/core/addr.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 65e3e7df8..b41afee77 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -821,6 +821,7 @@ int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid, union { struct sockaddr_in _sockaddr_in; struct sockaddr_in6 _sockaddr_in6; + struct sockaddr_ib _sockaddr_ib; } sgid_addr, dgid_addr; int ret; -- 2.34.1.75.gabe6bb3905 From 77ee331d433f7da2c3cc5c49ad8a433fdde31c63 Mon Sep 17 00:00:00 2001 From: Mark Weiman Date: Sun, 12 Aug 2018 11:36:21 -0400 Subject: [PATCH 03/18] pci: Enable overrides for missing ACS capabilities This an updated version of Alex Williamson's patch from: https://lkml.org/lkml/2013/5/30/513 Original commit message follows: PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that allows us to control whether transactions are allowed to be redirected in various subnodes of a PCIe topology. For instance, if two endpoints are below a root port or downsteam switch port, the downstream port may optionally redirect transactions between the devices, bypassing upstream devices. The same can happen internally on multifunction devices. The transaction may never be visible to the upstream devices. One upstream device that we particularly care about is the IOMMU. If a redirection occurs in the topology below the IOMMU, then the IOMMU cannot provide isolation between devices. This is why the PCIe spec encourages topologies to include ACS support. Without it, we have to assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation. Unfortunately, far too many topologies do not support ACS to make this a steadfast requirement. Even the latest chipsets from Intel are only sporadically supporting ACS. We have trouble getting interconnect vendors to include the PCIe spec required PCIe capability, let alone suggested features. Therefore, we need to add some flexibility. The pcie_acs_override= boot option lets users opt-in specific devices or sets of devices to assume ACS support. The "downstream" option assumes full ACS support on root ports and downstream switch ports. The "multifunction" option assumes the subset of ACS features available on multifunction endpoints and upstream switch ports are supported. The "id:nnnn:nnnn" option enables ACS support on devices matching the provided vendor and device IDs, allowing more strategic ACS overrides. These options may be combined in any order. A maximum of 16 id specific overrides are available. It's suggested to use the most limited set of options necessary to avoid completely disabling ACS across the topology. Note to hardware vendors, we have facilities to permanently quirk specific devices which enforce isolation but not provide an ACS capability. Please contact me to have your devices added and save your customers the hassle of this boot option. Signed-off-by: Mark Weiman Signed-off-by: Alexandre Frade --- .../admin-guide/kernel-parameters.txt | 9 ++ drivers/pci/quirks.c | 102 ++++++++++++++++++ 2 files changed, 111 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1396fd2d9..7f56c8eaa 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3892,6 +3892,15 @@ nomsi [MSI] If the PCI_MSI kernel config parameter is enabled, this kernel boot option can be used to disable the use of MSI interrupts system-wide. + pcie_acs_override = + [PCIE] Override missing PCIe ACS support for: + downstream + All downstream ports - full ACS capabilities + multifunction + All multifunction devices - multifunction ACS subset + id:nnnn:nnnn + Specific device - full ACS capabilities + Specified as vid:did (vendor/device ID) in hex noioapicquirk [APIC] Disable all boot interrupt quirks. Safety option to keep boot IRQs enabled. This should never be necessary. diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 208fa03ac..f78ed6e63 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3600,6 +3600,106 @@ static void quirk_nvidia_no_bus_reset(struct pci_dev *dev) DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, quirk_nvidia_no_bus_reset); +static bool acs_on_downstream; +static bool acs_on_multifunction; + +#define NUM_ACS_IDS 16 +struct acs_on_id { + unsigned short vendor; + unsigned short device; +}; +static struct acs_on_id acs_on_ids[NUM_ACS_IDS]; +static u8 max_acs_id; + +static __init int pcie_acs_override_setup(char *p) +{ + if (!p) + return -EINVAL; + + while (*p) { + if (!strncmp(p, "downstream", 10)) + acs_on_downstream = true; + if (!strncmp(p, "multifunction", 13)) + acs_on_multifunction = true; + if (!strncmp(p, "id:", 3)) { + char opt[5]; + int ret; + long val; + + if (max_acs_id >= NUM_ACS_IDS - 1) { + pr_warn("Out of PCIe ACS override slots (%d)\n", + NUM_ACS_IDS); + goto next; + } + + p += 3; + snprintf(opt, 5, "%s", p); + ret = kstrtol(opt, 16, &val); + if (ret) { + pr_warn("PCIe ACS ID parse error %d\n", ret); + goto next; + } + acs_on_ids[max_acs_id].vendor = val; + + p += strcspn(p, ":"); + if (*p != ':') { + pr_warn("PCIe ACS invalid ID\n"); + goto next; + } + + p++; + snprintf(opt, 5, "%s", p); + ret = kstrtol(opt, 16, &val); + if (ret) { + pr_warn("PCIe ACS ID parse error %d\n", ret); + goto next; + } + acs_on_ids[max_acs_id].device = val; + max_acs_id++; + } +next: + p += strcspn(p, ","); + if (*p == ',') + p++; + } + + if (acs_on_downstream || acs_on_multifunction || max_acs_id) + pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n"); + + return 0; +} +early_param("pcie_acs_override", pcie_acs_override_setup); + +static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags) +{ + int i; + + /* Never override ACS for legacy devices or devices with ACS caps */ + if (!pci_is_pcie(dev) || + pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS)) + return -ENOTTY; + + for (i = 0; i < max_acs_id; i++) + if (acs_on_ids[i].vendor == dev->vendor && + acs_on_ids[i].device == dev->device) + return 1; + + switch (pci_pcie_type(dev)) { + case PCI_EXP_TYPE_DOWNSTREAM: + case PCI_EXP_TYPE_ROOT_PORT: + if (acs_on_downstream) + return 1; + break; + case PCI_EXP_TYPE_ENDPOINT: + case PCI_EXP_TYPE_UPSTREAM: + case PCI_EXP_TYPE_LEG_END: + case PCI_EXP_TYPE_RC_END: + if (acs_on_multifunction && dev->multifunction) + return 1; + } + + return -ENOTTY; +} /* * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset. * The device will throw a Link Down error on AER-capable systems and @@ -4950,6 +5050,8 @@ static const struct pci_dev_acs_enabled { { PCI_VENDOR_ID_NXP, 0x8d9b, pci_quirk_nxp_rp_acs }, /* Zhaoxin Root/Downstream Ports */ { PCI_VENDOR_ID_ZHAOXIN, PCI_ANY_ID, pci_quirk_zhaoxin_pcie_ports_acs }, + /* PCIe ACS overrides */ + { PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides }, { 0 } }; -- 2.34.1.75.gabe6bb3905 From c33edd0e6d437e62304c1e5ed2b29ebbaf81e20c Mon Sep 17 00:00:00 2001 From: "Martin K. Petersen" Date: Sun, 22 Mar 2020 16:57:06 -0400 Subject: [PATCH 04/18] scsi: sd: Optimal I/O size should be a multiple of reported granularity Commit a83da8a4509d ("scsi: sd: Optimal I/O size should be a multiple of physical block size") validated the reported optimal I/O size against the physical block size to overcome problems with devices reporting nonsensical transfer sizes. However, some devices claim conformity to older SCSI versions that predate the physical block size being reported. Other devices do not report a physical block size at all. We need to be able to validate the optimal I/O size on those devices as well. Many devices report an OPTIMAL TRANSFER LENGTH GRANULARITY in the same VPD page as the OPTIMAL TRANSFER LENGTH. Use this value to validate the optimal I/O size. Also check that the reported granularity is a multiple of the physical block size, if supported. Link: https://lore.kernel.org/r/33fb522e-4f61-1b76-914f-c9e6a3553c9b@gmail.com Reported-by: Bernhard Sulzer Signed-off-by: Martin K. Petersen --- drivers/scsi/sd.c | 43 +++++++++++++++++++++++++++++++++++++++---- drivers/scsi/sd.h | 1 + 2 files changed, 40 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 78ead3369..921b39385 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2907,7 +2907,6 @@ static void sd_read_app_tag_own(struct scsi_disk *sdkp, unsigned char *buffer) */ static void sd_read_block_limits(struct scsi_disk *sdkp) { - unsigned int sector_sz = sdkp->device->sector_size; const int vpd_len = 64; unsigned char *buffer = kmalloc(vpd_len, GFP_KERNEL); @@ -2916,9 +2915,7 @@ static void sd_read_block_limits(struct scsi_disk *sdkp) scsi_get_vpd_page(sdkp->device, 0xb0, buffer, vpd_len)) goto out; - blk_queue_io_min(sdkp->disk->queue, - get_unaligned_be16(&buffer[6]) * sector_sz); - + sdkp->min_xfer_blocks = get_unaligned_be16(&buffer[6]); sdkp->max_xfer_blocks = get_unaligned_be32(&buffer[8]); sdkp->opt_xfer_blocks = get_unaligned_be32(&buffer[12]); @@ -3094,6 +3091,29 @@ static void sd_read_security(struct scsi_disk *sdkp, unsigned char *buffer) sdkp->security = 1; } +static bool sd_validate_min_xfer_size(struct scsi_disk *sdkp) +{ + struct scsi_device *sdp = sdkp->device; + unsigned int min_xfer_bytes = + logical_to_bytes(sdp, sdkp->min_xfer_blocks); + + if (sdkp->min_xfer_blocks == 0) + return false; + + if (min_xfer_bytes & (sdkp->physical_block_size - 1)) { + sd_first_printk(KERN_WARNING, sdkp, + "Preferred minimum I/O size %u bytes not a " \ + "multiple of physical block size (%u bytes)\n", + min_xfer_bytes, sdkp->physical_block_size); + sdkp->min_xfer_blocks = 0; + return false; + } + + sd_first_printk(KERN_INFO, sdkp, "Preferred minimum I/O size %u bytes\n", + min_xfer_bytes); + return true; +} + /* * Determine the device's preferred I/O size for reads and writes * unless the reported value is unreasonably small, large, not a @@ -3105,6 +3125,8 @@ static bool sd_validate_opt_xfer_size(struct scsi_disk *sdkp, struct scsi_device *sdp = sdkp->device; unsigned int opt_xfer_bytes = logical_to_bytes(sdp, sdkp->opt_xfer_blocks); + unsigned int min_xfer_bytes = + logical_to_bytes(sdp, sdkp->min_xfer_blocks); if (sdkp->opt_xfer_blocks == 0) return false; @@ -3141,6 +3163,15 @@ static bool sd_validate_opt_xfer_size(struct scsi_disk *sdkp, return false; } + if (min_xfer_bytes && opt_xfer_bytes & (min_xfer_bytes - 1)) { + sd_first_printk(KERN_WARNING, sdkp, + "Optimal transfer size %u bytes not a " \ + "multiple of preferred minimum block " \ + "size (%u bytes)\n", + opt_xfer_bytes, min_xfer_bytes); + return false; + } + sd_first_printk(KERN_INFO, sdkp, "Optimal transfer size %u bytes\n", opt_xfer_bytes); return true; @@ -3224,6 +3255,10 @@ static int sd_revalidate_disk(struct gendisk *disk) dev_max = min_not_zero(dev_max, sdkp->max_xfer_blocks); q->limits.max_dev_sectors = logical_to_sectors(sdp, dev_max); + if (sd_validate_min_xfer_size(sdkp)) + blk_queue_io_min(sdkp->disk->queue, + logical_to_bytes(sdp, sdkp->min_xfer_blocks)); + if (sd_validate_opt_xfer_size(sdkp, dev_max)) { q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks); rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks); diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index b59136c41..4154030c8 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -91,6 +91,7 @@ struct scsi_disk { atomic_t openers; sector_t capacity; /* size in logical blocks */ int max_retries; + u32 min_xfer_blocks; u32 max_xfer_blocks; u32 opt_xfer_blocks; u32 max_ws_blocks; -- 2.34.1.75.gabe6bb3905 From bb71d8c5bb825e2074e0b044d06ee2336c1a907b Mon Sep 17 00:00:00 2001 From: Yafang Shao Date: Thu, 4 Jun 2020 03:05:47 -0400 Subject: [PATCH 05/18] iomap: avoid deadlock if memory reclaim is triggered in writepage path Recently there is a XFS deadlock on our server with an old kernel. This deadlock is caused by allocating memory in xfs_map_blocks() while doing writeback on behalf of memroy reclaim. Although this deadlock happens on an old kernel, I think it could happen on the upstream as well. This issue only happens once and can't be reproduced, so I haven't tried to reproduce it on upsteam kernel. Bellow is the call trace of this deadlock. [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds. [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [480594.790088] redis-server D ffffffff8168bd60 0 16212 14347 0x00000004 [480594.790090] ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8 [480594.790092] ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40 [480594.790094] 0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60 [480594.790096] Call Trace: [480594.790101] [] schedule+0x29/0x70 [480594.790103] [] schedule_timeout+0x239/0x2c0 [480594.790111] [] io_schedule_timeout+0xae/0x130 [480594.790114] [] io_schedule+0x18/0x20 [480594.790116] [] bit_wait_io+0x11/0x50 [480594.790118] [] __wait_on_bit+0x65/0x90 [480594.790121] [] wait_on_page_bit+0x81/0xa0 [480594.790125] [] shrink_page_list+0x6d2/0xaf0 [480594.790130] [] shrink_inactive_list+0x223/0x710 [480594.790135] [] shrink_lruvec+0x3b5/0x810 [480594.790139] [] shrink_zone+0xba/0x1e0 [480594.790141] [] do_try_to_free_pages+0x100/0x510 [480594.790143] [] try_to_free_mem_cgroup_pages+0xdd/0x170 [480594.790145] [] mem_cgroup_reclaim+0x4e/0x120 [480594.790147] [] __mem_cgroup_try_charge+0x41c/0x670 [480594.790153] [] __memcg_kmem_newpage_charge+0xf6/0x180 [480594.790157] [] __alloc_pages_nodemask+0x22d/0x420 [480594.790162] [] alloc_pages_current+0xaa/0x170 [480594.790165] [] new_slab+0x30c/0x320 [480594.790168] [] ___slab_alloc+0x3ac/0x4f0 [480594.790204] [] __slab_alloc+0x40/0x5c [480594.790206] [] kmem_cache_alloc+0x193/0x1e0 [480594.790233] [] kmem_zone_alloc+0x97/0x130 [xfs] [480594.790247] [] _xfs_trans_alloc+0x3a/0xa0 [xfs] [480594.790261] [] xfs_trans_alloc+0x3c/0x50 [xfs] [480594.790276] [] xfs_iomap_write_allocate+0x1cb/0x390 [xfs] [480594.790299] [] xfs_map_blocks+0x1a6/0x210 [xfs] [480594.790312] [] xfs_do_writepage+0x17b/0x550 [xfs] [480594.790314] [] write_cache_pages+0x251/0x4d0 [xfs] [480594.790338] [] xfs_vm_writepages+0xc5/0xe0 [xfs] [480594.790341] [] do_writepages+0x1e/0x40 [480594.790343] [] __filemap_fdatawrite_range+0x65/0x80 [480594.790346] [] filemap_write_and_wait_range+0x41/0x90 [480594.790360] [] xfs_file_fsync+0x66/0x1e0 [xfs] [480594.790363] [] do_fsync+0x65/0xa0 [480594.790365] [] SyS_fdatasync+0x13/0x20 [480594.790367] [] system_call_fastpath+0x16/0x1b Note that xfs_iomap_write_allocate() is replaced by xfs_convert_blocks() in commit 4ad765edb02a ("xfs: move xfs_iomap_write_allocate to xfs_aops.c") and write_cache_pages() is replaced by iomap_writepages() in commit 598ecfbaa742 ("iomap: lift the xfs writeback code to iomap"). So for upsteam, the call trace should be, xfs_vm_writepages -> iomap_writepages -> write_cache_pages -> iomap_do_writepage -> xfs_map_blocks -> xfs_convert_blocks -> xfs_bmapi_convert_delalloc -> xfs_trans_alloc //It should alloc page with GFP_NOFS Signed-off-by: Yafang Shao --- fs/iomap/buffered-io.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 97119ec3b..b878d225a 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -17,6 +17,7 @@ #include #include #include +#include #include "trace.h" #include "../internal.h" @@ -1395,9 +1396,11 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) { struct iomap_writepage_ctx *wpc = data; struct inode *inode = page->mapping->host; + unsigned int nofs_flag; pgoff_t end_index; u64 end_offset; loff_t offset; + int ret; trace_iomap_writepage(inode, page_offset(page), PAGE_SIZE); @@ -1481,7 +1484,16 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) end_offset = offset; } - return iomap_writepage_map(wpc, wbc, inode, page, end_offset); + /* + * We can allocate memory here while doing writeback on behalf of + * memory reclaim. To avoid memory allocation deadlocks set the + * task-wide nofs context for the following operations. + */ + nofs_flag = memalloc_nofs_save(); + ret = iomap_writepage_map(wpc, wbc, inode, page, end_offset); + memalloc_nofs_restore(nofs_flag); + + return ret; redirty: redirty_page_for_writepage(wbc, page); -- 2.34.1.75.gabe6bb3905 From a3b073aa89cffda192803a812eebfeecab0cf470 Mon Sep 17 00:00:00 2001 From: Piotr Gorski Date: Mon, 30 Aug 2021 13:28:51 +0200 Subject: [PATCH 06/18] tty: Allow setting the number of available virtual TTYs Signed-off-by: Piotr Gorski --- drivers/tty/Kconfig | 13 +++++++++++++ include/uapi/linux/vt.h | 15 ++++++++++++++- 2 files changed, 27 insertions(+), 1 deletion(-) diff --git a/drivers/tty/Kconfig b/drivers/tty/Kconfig index 23cc988c6..d49b71eab 100644 --- a/drivers/tty/Kconfig +++ b/drivers/tty/Kconfig @@ -75,6 +75,19 @@ config VT_CONSOLE_SLEEP def_bool y depends on VT_CONSOLE && PM_SLEEP +config NR_TTY_DEVICES + int "Maximum tty device number" + depends on VT + range 12 63 + default 63 + help + This option is used to change the number of tty devices in /dev. + The default value is 63. The lowest number you can set is 12, + 63 is also the upper limit so we don't overrun the serial + consoles. + + If unsure, say 63. + config HW_CONSOLE bool depends on VT diff --git a/include/uapi/linux/vt.h b/include/uapi/linux/vt.h index e9d39c485..3bceead8d 100644 --- a/include/uapi/linux/vt.h +++ b/include/uapi/linux/vt.h @@ -3,12 +3,25 @@ #define _UAPI_LINUX_VT_H +/* + * We will make this definition solely for the purpose of making packages + * such as splashutils build, because they can not understand that + * NR_TTY_DEVICES is defined in the kernel configuration. + */ +#ifndef CONFIG_NR_TTY_DEVICES +#define CONFIG_NR_TTY_DEVICES 63 +#endif + /* * These constants are also useful for user-level apps (e.g., VC * resizing). */ #define MIN_NR_CONSOLES 1 /* must be at least 1 */ -#define MAX_NR_CONSOLES 63 /* serial lines start at 64 */ +/* + * NR_TTY_DEVICES: + * Value MUST be at least 12 and must never be higher then 63 + */ +#define MAX_NR_CONSOLES CONFIG_NR_TTY_DEVICES /* serial lines start above this */ /* Note: the ioctl VT_GETSTATE does not work for consoles 16 and higher (since it returns a short) */ -- 2.34.1.75.gabe6bb3905 From 67531574d96dcda473b33bf9fb895d154e97b069 Mon Sep 17 00:00:00 2001 From: Alexandre Frade Date: Thu, 7 Oct 2021 14:09:55 +0000 Subject: [PATCH 07/18] i2c: busses: Add SMBus capability to work with OpenRGB driver control Signed-off-by: Alexandre Frade --- drivers/i2c/busses/Kconfig | 9 + drivers/i2c/busses/Makefile | 1 + drivers/i2c/busses/i2c-nct6775.c | 647 +++++++++++++++++++++++++++++++ drivers/i2c/busses/i2c-piix4.c | 9 +- 4 files changed, 664 insertions(+), 2 deletions(-) create mode 100644 drivers/i2c/busses/i2c-nct6775.c diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig index e17790fe3..d4dd2a1dc 100644 --- a/drivers/i2c/busses/Kconfig +++ b/drivers/i2c/busses/Kconfig @@ -219,6 +219,15 @@ config I2C_CHT_WC combined with a FUSB302 Type-C port-controller as such it is advised to also select CONFIG_TYPEC_FUSB302=m. +config I2C_NCT6775 + tristate "Nuvoton NCT6775 and compatible SMBus controller" + help + If you say yes to this option, support will be included for the + Nuvoton NCT6775 and compatible SMBus controllers. + + This driver can also be built as a module. If so, the module + will be called i2c-nct6775. + config I2C_NFORCE2 tristate "Nvidia nForce2, nForce3 and nForce4" depends on PCI diff --git a/drivers/i2c/busses/Makefile b/drivers/i2c/busses/Makefile index 1336b04f4..ccbe82704 100644 --- a/drivers/i2c/busses/Makefile +++ b/drivers/i2c/busses/Makefile @@ -17,6 +17,7 @@ obj-$(CONFIG_I2C_CHT_WC) += i2c-cht-wc.o obj-$(CONFIG_I2C_I801) += i2c-i801.o obj-$(CONFIG_I2C_ISCH) += i2c-isch.o obj-$(CONFIG_I2C_ISMT) += i2c-ismt.o +obj-$(CONFIG_I2C_NCT6775) += i2c-nct6775.o obj-$(CONFIG_I2C_NFORCE2) += i2c-nforce2.o obj-$(CONFIG_I2C_NFORCE2_S4985) += i2c-nforce2-s4985.o obj-$(CONFIG_I2C_NVIDIA_GPU) += i2c-nvidia-gpu.o diff --git a/drivers/i2c/busses/i2c-nct6775.c b/drivers/i2c/busses/i2c-nct6775.c new file mode 100644 index 000000000..b59f842f7 --- /dev/null +++ b/drivers/i2c/busses/i2c-nct6775.c @@ -0,0 +1,647 @@ +/* + * i2c-nct6775 - Driver for the SMBus master functionality of + * Nuvoton NCT677x Super-I/O chips + * + * Copyright (C) 2019 Adam Honse + * + * Derived from nct6775 hwmon driver + * Copyright (C) 2012 Guenter Roeck + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRVNAME "i2c-nct6775" + +/* Nuvoton SMBus address offsets */ +#define SMBHSTDAT (0 + nuvoton_nct6793d_smba) +#define SMBBLKSZ (1 + nuvoton_nct6793d_smba) +#define SMBHSTCMD (2 + nuvoton_nct6793d_smba) +#define SMBHSTIDX (3 + nuvoton_nct6793d_smba) //Index field is the Command field on other controllers +#define SMBHSTCTL (4 + nuvoton_nct6793d_smba) +#define SMBHSTADD (5 + nuvoton_nct6793d_smba) +#define SMBHSTERR (9 + nuvoton_nct6793d_smba) +#define SMBHSTSTS (0xE + nuvoton_nct6793d_smba) + +/* Command register */ +#define NCT6793D_READ_BYTE 0 +#define NCT6793D_READ_WORD 1 +#define NCT6793D_READ_BLOCK 2 +#define NCT6793D_BLOCK_WRITE_READ_PROC_CALL 3 +#define NCT6793D_PROC_CALL 4 +#define NCT6793D_WRITE_BYTE 8 +#define NCT6793D_WRITE_WORD 9 +#define NCT6793D_WRITE_BLOCK 10 + +/* Control register */ +#define NCT6793D_MANUAL_START 128 +#define NCT6793D_SOFT_RESET 64 + +/* Error register */ +#define NCT6793D_NO_ACK 32 + +/* Status register */ +#define NCT6793D_FIFO_EMPTY 1 +#define NCT6793D_FIFO_FULL 2 +#define NCT6793D_MANUAL_ACTIVE 4 + +#define NCT6775_LD_SMBUS 0x0B + +/* Other settings */ +#define MAX_RETRIES 400 + +enum kinds { nct6106, nct6775, nct6776, nct6779, nct6791, nct6792, nct6793, + nct6795, nct6796, nct6798 }; + +struct nct6775_sio_data { + int sioreg; + enum kinds kind; +}; + +/* used to set data->name = nct6775_device_names[data->sio_kind] */ +static const char * const nct6775_device_names[] = { + "nct6106", + "nct6775", + "nct6776", + "nct6779", + "nct6791", + "nct6792", + "nct6793", + "nct6795", + "nct6796", + "nct6798", +}; + +static const char * const nct6775_sio_names[] __initconst = { + "NCT6106D", + "NCT6775F", + "NCT6776D/F", + "NCT6779D", + "NCT6791D", + "NCT6792D", + "NCT6793D", + "NCT6795D", + "NCT6796D", + "NCT6798D", +}; + +#define SIO_REG_LDSEL 0x07 /* Logical device select */ +#define SIO_REG_DEVID 0x20 /* Device ID (2 bytes) */ +#define SIO_REG_SMBA 0x62 /* SMBus base address register */ + +#define SIO_NCT6106_ID 0xc450 +#define SIO_NCT6775_ID 0xb470 +#define SIO_NCT6776_ID 0xc330 +#define SIO_NCT6779_ID 0xc560 +#define SIO_NCT6791_ID 0xc800 +#define SIO_NCT6792_ID 0xc910 +#define SIO_NCT6793_ID 0xd120 +#define SIO_NCT6795_ID 0xd350 +#define SIO_NCT6796_ID 0xd420 +#define SIO_NCT6798_ID 0xd428 +#define SIO_ID_MASK 0xFFF0 + +static inline void +superio_outb(int ioreg, int reg, int val) +{ + outb(reg, ioreg); + outb(val, ioreg + 1); +} + +static inline int +superio_inb(int ioreg, int reg) +{ + outb(reg, ioreg); + return inb(ioreg + 1); +} + +static inline void +superio_select(int ioreg, int ld) +{ + outb(SIO_REG_LDSEL, ioreg); + outb(ld, ioreg + 1); +} + +static inline int +superio_enter(int ioreg) +{ + /* + * Try to reserve and for exclusive access. + */ + if (!request_muxed_region(ioreg, 2, DRVNAME)) + return -EBUSY; + + outb(0x87, ioreg); + outb(0x87, ioreg); + + return 0; +} + +static inline void +superio_exit(int ioreg) +{ + outb(0xaa, ioreg); + outb(0x02, ioreg); + outb(0x02, ioreg + 1); + release_region(ioreg, 2); +} + +/* + * ISA constants + */ + +#define IOREGION_ALIGNMENT (~7) +#define IOREGION_LENGTH 2 +#define ADDR_REG_OFFSET 0 +#define DATA_REG_OFFSET 1 + +#define NCT6775_REG_BANK 0x4E +#define NCT6775_REG_CONFIG 0x40 + +static struct i2c_adapter *nct6775_adapter; + +struct i2c_nct6775_adapdata { + unsigned short smba; +}; + +/* Return negative errno on error. */ +static s32 nct6775_access(struct i2c_adapter * adap, u16 addr, + unsigned short flags, char read_write, + u8 command, int size, union i2c_smbus_data * data) +{ + struct i2c_nct6775_adapdata *adapdata = i2c_get_adapdata(adap); + unsigned short nuvoton_nct6793d_smba = adapdata->smba; + int i, len, cnt; + union i2c_smbus_data tmp_data; + int timeout = 0; + + tmp_data.word = 0; + cnt = 0; + len = 0; + + outb_p(NCT6793D_SOFT_RESET, SMBHSTCTL); + + switch (size) { + case I2C_SMBUS_QUICK: + outb_p((addr << 1) | read_write, + SMBHSTADD); + break; + case I2C_SMBUS_BYTE_DATA: + tmp_data.byte = data->byte; + case I2C_SMBUS_BYTE: + outb_p((addr << 1) | read_write, + SMBHSTADD); + outb_p(command, SMBHSTIDX); + if (read_write == I2C_SMBUS_WRITE) { + outb_p(tmp_data.byte, SMBHSTDAT); + outb_p(NCT6793D_WRITE_BYTE, SMBHSTCMD); + } + else { + outb_p(NCT6793D_READ_BYTE, SMBHSTCMD); + } + break; + case I2C_SMBUS_WORD_DATA: + outb_p((addr << 1) | read_write, + SMBHSTADD); + outb_p(command, SMBHSTIDX); + if (read_write == I2C_SMBUS_WRITE) { + outb_p(data->word & 0xff, SMBHSTDAT); + outb_p((data->word & 0xff00) >> 8, SMBHSTDAT); + outb_p(NCT6793D_WRITE_WORD, SMBHSTCMD); + } + else { + outb_p(NCT6793D_READ_WORD, SMBHSTCMD); + } + break; + case I2C_SMBUS_BLOCK_DATA: + outb_p((addr << 1) | read_write, + SMBHSTADD); + outb_p(command, SMBHSTIDX); + if (read_write == I2C_SMBUS_WRITE) { + len = data->block[0]; + if (len == 0 || len > I2C_SMBUS_BLOCK_MAX) + return -EINVAL; + outb_p(len, SMBBLKSZ); + + cnt = 1; + if (len >= 4) { + for (i = cnt; i <= 4; i++) { + outb_p(data->block[i], SMBHSTDAT); + } + + len -= 4; + cnt += 4; + } + else { + for (i = cnt; i <= len; i++ ) { + outb_p(data->block[i], SMBHSTDAT); + } + + len = 0; + } + + outb_p(NCT6793D_WRITE_BLOCK, SMBHSTCMD); + } + else { + return -ENOTSUPP; + } + break; + default: + dev_warn(&adap->dev, "Unsupported transaction %d\n", size); + return -EOPNOTSUPP; + } + + outb_p(NCT6793D_MANUAL_START, SMBHSTCTL); + + while ((size == I2C_SMBUS_BLOCK_DATA) && (len > 0)) { + if (read_write == I2C_SMBUS_WRITE) { + timeout = 0; + while ((inb_p(SMBHSTSTS) & NCT6793D_FIFO_EMPTY) == 0) + { + if(timeout > MAX_RETRIES) + { + return -ETIMEDOUT; + } + usleep_range(250, 500); + timeout++; + } + + //Load more bytes into FIFO + if (len >= 4) { + for (i = cnt; i <= (cnt + 4); i++) { + outb_p(data->block[i], SMBHSTDAT); + } + + len -= 4; + cnt += 4; + } + else { + for (i = cnt; i <= (cnt + len); i++) { + outb_p(data->block[i], SMBHSTDAT); + } + + len = 0; + } + } + else { + return -ENOTSUPP; + } + + } + + //wait for manual mode to complete + timeout = 0; + while ((inb_p(SMBHSTSTS) & NCT6793D_MANUAL_ACTIVE) != 0) + { + if(timeout > MAX_RETRIES) + { + return -ETIMEDOUT; + } + usleep_range(250, 500); + timeout++; + } + + if ((inb_p(SMBHSTERR) & NCT6793D_NO_ACK) != 0) { + return -ENXIO; + } + else if ((read_write == I2C_SMBUS_WRITE) || (size == I2C_SMBUS_QUICK)) { + return 0; + } + + switch (size) { + case I2C_SMBUS_QUICK: + case I2C_SMBUS_BYTE_DATA: + data->byte = inb_p(SMBHSTDAT); + break; + case I2C_SMBUS_WORD_DATA: + data->word = inb_p(SMBHSTDAT) + (inb_p(SMBHSTDAT) << 8); + break; + } + return 0; +} + +static u32 nct6775_func(struct i2c_adapter *adapter) +{ + return I2C_FUNC_SMBUS_QUICK | I2C_FUNC_SMBUS_BYTE | + I2C_FUNC_SMBUS_BYTE_DATA | I2C_FUNC_SMBUS_WORD_DATA | + I2C_FUNC_SMBUS_BLOCK_DATA; +} + +static const struct i2c_algorithm smbus_algorithm = { + .smbus_xfer = nct6775_access, + .functionality = nct6775_func, +}; + +static int nct6775_add_adapter(unsigned short smba, const char *name, struct i2c_adapter **padap) +{ + struct i2c_adapter *adap; + struct i2c_nct6775_adapdata *adapdata; + int retval; + + adap = kzalloc(sizeof(*adap), GFP_KERNEL); + if (adap == NULL) { + return -ENOMEM; + } + + adap->owner = THIS_MODULE; + adap->class = I2C_CLASS_HWMON | I2C_CLASS_SPD; + adap->algo = &smbus_algorithm; + + adapdata = kzalloc(sizeof(*adapdata), GFP_KERNEL); + if (adapdata == NULL) { + kfree(adap); + return -ENOMEM; + } + + adapdata->smba = smba; + + snprintf(adap->name, sizeof(adap->name), + "SMBus NCT67xx adapter%s at %04x", name, smba); + + i2c_set_adapdata(adap, adapdata); + + retval = i2c_add_adapter(adap); + if (retval) { + kfree(adapdata); + kfree(adap); + return retval; + } + + *padap = adap; + return 0; +} + +static void nct6775_remove_adapter(struct i2c_adapter *adap) +{ + struct i2c_nct6775_adapdata *adapdata = i2c_get_adapdata(adap); + + if (adapdata->smba) { + i2c_del_adapter(adap); + kfree(adapdata); + kfree(adap); + } +} + +//static SIMPLE_DEV_PM_OPS(nct6775_dev_pm_ops, nct6775_suspend, nct6775_resume); + +/* + * when Super-I/O functions move to a separate file, the Super-I/O + * bus will manage the lifetime of the device and this module will only keep + * track of the nct6775 driver. But since we use platform_device_alloc(), we + * must keep track of the device + */ +static struct platform_device *pdev[2]; + +static int nct6775_probe(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct nct6775_sio_data *sio_data = dev_get_platdata(dev); + struct resource *res; + + res = platform_get_resource(pdev, IORESOURCE_IO, 0); + if (!devm_request_region(&pdev->dev, res->start, IOREGION_LENGTH, + DRVNAME)) + return -EBUSY; + + switch (sio_data->kind) { + case nct6791: + case nct6792: + case nct6793: + case nct6795: + case nct6796: + case nct6798: + nct6775_add_adapter(res->start, "", &nct6775_adapter); + break; + default: + return -ENODEV; + } + + return 0; +} +/* +static void nct6791_enable_io_mapping(int sioaddr) +{ + int val; + + val = superio_inb(sioaddr, NCT6791_REG_HM_IO_SPACE_LOCK_ENABLE); + if (val & 0x10) { + pr_info("Enabling hardware monitor logical device mappings.\n"); + superio_outb(sioaddr, NCT6791_REG_HM_IO_SPACE_LOCK_ENABLE, + val & ~0x10); + } +}*/ + +static struct platform_driver i2c_nct6775_driver = { + .driver = { + .name = DRVNAME, +// .pm = &nct6775_dev_pm_ops, + }, + .probe = nct6775_probe, +}; + +static void __exit i2c_nct6775_exit(void) +{ + int i; + + if(nct6775_adapter) + nct6775_remove_adapter(nct6775_adapter); + + for (i = 0; i < ARRAY_SIZE(pdev); i++) { + if (pdev[i]) + platform_device_unregister(pdev[i]); + } + platform_driver_unregister(&i2c_nct6775_driver); +} + +/* nct6775_find() looks for a '627 in the Super-I/O config space */ +static int __init nct6775_find(int sioaddr, struct nct6775_sio_data *sio_data) +{ + u16 val; + int err; + int addr; + + err = superio_enter(sioaddr); + if (err) + return err; + + val = (superio_inb(sioaddr, SIO_REG_DEVID) << 8) | + superio_inb(sioaddr, SIO_REG_DEVID + 1); + + switch (val & SIO_ID_MASK) { + case SIO_NCT6106_ID: + sio_data->kind = nct6106; + break; + case SIO_NCT6775_ID: + sio_data->kind = nct6775; + break; + case SIO_NCT6776_ID: + sio_data->kind = nct6776; + break; + case SIO_NCT6779_ID: + sio_data->kind = nct6779; + break; + case SIO_NCT6791_ID: + sio_data->kind = nct6791; + break; + case SIO_NCT6792_ID: + sio_data->kind = nct6792; + break; + case SIO_NCT6793_ID: + sio_data->kind = nct6793; + break; + case SIO_NCT6795_ID: + sio_data->kind = nct6795; + break; + case SIO_NCT6796_ID: + sio_data->kind = nct6796; + break; + case SIO_NCT6798_ID: + sio_data->kind = nct6798; + break; + default: + if (val != 0xffff) + pr_debug("unsupported chip ID: 0x%04x\n", val); + superio_exit(sioaddr); + return -ENODEV; + } + + /* We have a known chip, find the SMBus I/O address */ + superio_select(sioaddr, NCT6775_LD_SMBUS); + val = (superio_inb(sioaddr, SIO_REG_SMBA) << 8) + | superio_inb(sioaddr, SIO_REG_SMBA + 1); + addr = val & IOREGION_ALIGNMENT; + if (addr == 0) { + pr_err("Refusing to enable a Super-I/O device with a base I/O port 0\n"); + superio_exit(sioaddr); + return -ENODEV; + } + + //if (sio_data->kind == nct6791 || sio_data->kind == nct6792 || + // sio_data->kind == nct6793 || sio_data->kind == nct6795 || + // sio_data->kind == nct6796) + // nct6791_enable_io_mapping(sioaddr); + + superio_exit(sioaddr); + pr_info("Found %s or compatible chip at %#x:%#x\n", + nct6775_sio_names[sio_data->kind], sioaddr, addr); + sio_data->sioreg = sioaddr; + + return addr; +} + +static int __init i2c_nct6775_init(void) +{ + int i, err; + bool found = false; + int address; + struct resource res; + struct nct6775_sio_data sio_data; + int sioaddr[2] = { 0x2e, 0x4e }; + + err = platform_driver_register(&i2c_nct6775_driver); + if (err) + return err; + + /* + * initialize sio_data->kind and sio_data->sioreg. + * + * when Super-I/O functions move to a separate file, the Super-I/O + * driver will probe 0x2e and 0x4e and auto-detect the presence of a + * nct6775 hardware monitor, and call probe() + */ + for (i = 0; i < ARRAY_SIZE(pdev); i++) { + address = nct6775_find(sioaddr[i], &sio_data); + if (address <= 0) + continue; + + found = true; + + pdev[i] = platform_device_alloc(DRVNAME, address); + if (!pdev[i]) { + err = -ENOMEM; + goto exit_device_unregister; + } + + err = platform_device_add_data(pdev[i], &sio_data, + sizeof(struct nct6775_sio_data)); + if (err) + goto exit_device_put; + + memset(&res, 0, sizeof(res)); + res.name = DRVNAME; + res.start = address; + res.end = address + IOREGION_LENGTH - 1; + res.flags = IORESOURCE_IO; + + err = acpi_check_resource_conflict(&res); + if (err) { + platform_device_put(pdev[i]); + pdev[i] = NULL; + continue; + } + + err = platform_device_add_resources(pdev[i], &res, 1); + if (err) + goto exit_device_put; + + /* platform_device_add calls probe() */ + err = platform_device_add(pdev[i]); + if (err) + goto exit_device_put; + } + if (!found) { + err = -ENODEV; + goto exit_unregister; + } + + return 0; + +exit_device_put: + platform_device_put(pdev[i]); +exit_device_unregister: + while (--i >= 0) { + if (pdev[i]) + platform_device_unregister(pdev[i]); + } +exit_unregister: + platform_driver_unregister(&i2c_nct6775_driver); + return err; +} + +MODULE_AUTHOR("Adam Honse "); +MODULE_DESCRIPTION("SMBus driver for NCT6775F and compatible chips"); +MODULE_LICENSE("GPL"); + +module_init(i2c_nct6775_init); +module_exit(i2c_nct6775_exit); diff --git a/drivers/i2c/busses/i2c-piix4.c b/drivers/i2c/busses/i2c-piix4.c index 8c1b31ed0..9131e4004 100644 --- a/drivers/i2c/busses/i2c-piix4.c +++ b/drivers/i2c/busses/i2c-piix4.c @@ -467,11 +467,11 @@ static int piix4_transaction(struct i2c_adapter *piix4_adapter) if (srvrworks_csb5_delay) /* Extra delay for SERVERWORKS_CSB5 */ usleep_range(2000, 2100); else - usleep_range(250, 500); + usleep_range(25, 50); while ((++timeout < MAX_TIMEOUT) && ((temp = inb_p(SMBHSTSTS)) & 0x01)) - usleep_range(250, 500); + usleep_range(25, 50); /* If the SMBus is still busy, we give up */ if (timeout == MAX_TIMEOUT) { @@ -982,6 +982,11 @@ static int piix4_probe(struct pci_dev *dev, const struct pci_device_id *id) retval = piix4_setup_sb800(dev, id, 1); } + if (dev->vendor == PCI_VENDOR_ID_AMD && + dev->device == PCI_DEVICE_ID_AMD_KERNCZ_SMBUS) { + retval = piix4_setup_sb800(dev, id, 1); + } + if (retval > 0) { /* Try to add the aux adapter if it exists, * piix4_add_adapter will clean up if this fails */ -- 2.34.1.75.gabe6bb3905 From ae79d2c0669b5233d16468af9d5151b60e9518c0 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Tue, 12 Oct 2021 12:13:52 -0600 Subject: [PATCH 08/18] nvme: don't memset() the normal read/write command This memset in the fast path costs a lot of cycles on my setup. Here's a top-of-profile of doing ~6.7M IOPS: + 5.90% io_uring [nvme] [k] nvme_queue_rq + 5.32% io_uring [nvme_core] [k] nvme_setup_cmd + 5.17% io_uring [kernel.vmlinux] [k] io_submit_sqes + 4.97% io_uring [kernel.vmlinux] [k] blkdev_direct_IO and a perf diff with this patch: 0.92% +4.40% [nvme_core] [k] nvme_setup_cmd reducing it from 5.3% to only 0.9%. This takes it from the 2nd most cycle consumer to something that's mostly irrelevant. Retain the full clear for the other commands to avoid doing any audits there, and just clear the fields in the rw command manually that we don't already fill. Signed-off-by: Jens Axboe --- drivers/nvme/host/core.c | 30 +++++++++++++++++++++++++----- 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f8dd664b2..2d609ee57 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -922,9 +922,16 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH; cmnd->rw.opcode = op; + cmnd->rw.flags = 0; + cmnd->rw.command_id = 0; cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id); + cmnd->rw.rsvd2 = 0; + cmnd->rw.metadata = 0; cmnd->rw.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); + cmnd->rw.reftag = 0; + cmnd->rw.apptag = 0; + cmnd->rw.appmask = 0; if (req_op(req) == REQ_OP_WRITE && ctrl->nr_streams) nvme_assign_write_stream(ctrl, req, &control, &dsmgmt); @@ -975,51 +982,64 @@ void nvme_cleanup_cmd(struct request *req) } EXPORT_SYMBOL_GPL(nvme_cleanup_cmd); +static void nvme_clear_cmd(struct request *req) +{ + if (!(req->rq_flags & RQF_DONTPREP)) { + nvme_clear_nvme_request(req); + memset(nvme_req(req)->cmd, 0, sizeof(struct nvme_command)); + } +} + blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req) { struct nvme_command *cmd = nvme_req(req)->cmd; struct nvme_ctrl *ctrl = nvme_req(req)->ctrl; blk_status_t ret = BLK_STS_OK; - if (!(req->rq_flags & RQF_DONTPREP)) { - nvme_clear_nvme_request(req); - memset(cmd, 0, sizeof(*cmd)); - } - switch (req_op(req)) { case REQ_OP_DRV_IN: case REQ_OP_DRV_OUT: /* these are setup prior to execution in nvme_init_request() */ break; case REQ_OP_FLUSH: + nvme_clear_cmd(req); nvme_setup_flush(ns, cmd); break; case REQ_OP_ZONE_RESET_ALL: case REQ_OP_ZONE_RESET: + nvme_clear_cmd(req); ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_RESET); break; case REQ_OP_ZONE_OPEN: + nvme_clear_cmd(req); ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OPEN); break; case REQ_OP_ZONE_CLOSE: + nvme_clear_cmd(req); ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_CLOSE); break; case REQ_OP_ZONE_FINISH: + nvme_clear_cmd(req); ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH); break; case REQ_OP_WRITE_ZEROES: + nvme_clear_cmd(req); ret = nvme_setup_write_zeroes(ns, req, cmd); break; case REQ_OP_DISCARD: + nvme_clear_cmd(req); ret = nvme_setup_discard(ns, req, cmd); break; case REQ_OP_READ: + nvme_clear_nvme_request(req); ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read); break; case REQ_OP_WRITE: + nvme_clear_nvme_request(req); ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write); break; case REQ_OP_ZONE_APPEND: + nvme_clear_nvme_request(req); ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); break; default: -- 2.34.1.75.gabe6bb3905 From 05b8cf2f9a5470f73bf00221edbfec3ac3278a49 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Thu, 20 Feb 2020 20:30:52 -0800 Subject: [PATCH 09/18] mm: Stop kswapd early when nothing's waiting for it to free pages Keeping kswapd running when all the failed allocations that invoked it are satisfied incurs a high overhead due to unnecessary page eviction and writeback, as well as spurious VM pressure events to various registered shrinkers. When kswapd doesn't need to work to make an allocation succeed anymore, stop it prematurely to save resources. Signed-off-by: Sultan Alsawaf --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 17 ++++++++++++++--- mm/vmscan.c | 3 ++- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d84..85be97c56 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -839,6 +839,7 @@ typedef struct pglist_data { unsigned long node_spanned_pages; /* total size of physical page range, including holes */ int node_id; + atomic_t kswapd_waiters; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 23d3339ac..1756ecd05 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4883,6 +4883,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int no_progress_loops; unsigned int cpuset_mems_cookie; int reserve_flags; + pg_data_t *pgdat = ac->preferred_zoneref->zone->zone_pgdat; + bool woke_kswapd = false; /* * We also sanity check to catch abuse of atomic reserves being used by @@ -4916,8 +4918,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (!ac->preferred_zoneref->zone) goto nopage; - if (alloc_flags & ALLOC_KSWAPD) + if (alloc_flags & ALLOC_KSWAPD) { + if (!woke_kswapd) { + atomic_inc(&pgdat->kswapd_waiters); + woke_kswapd = true; + } wake_all_kswapds(order, gfp_mask, ac); + } /* * The adjusted alloc_flags might result in immediate success, so try @@ -5122,9 +5129,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto retry; } fail: - warn_alloc(gfp_mask, ac->nodemask, - "page allocation failure: order:%u", order); got_pg: + if (woke_kswapd) + atomic_dec(&pgdat->kswapd_waiters); + if (!page) + warn_alloc(gfp_mask, ac->nodemask, + "page allocation failure: order:%u", order); return page; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 74296c2d1..e513736b2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4082,7 +4082,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) __fs_reclaim_release(_THIS_IP_); ret = try_to_freeze(); __fs_reclaim_acquire(_THIS_IP_); - if (ret || kthread_should_stop()) + if (ret || kthread_should_stop() || + !atomic_read(&pgdat->kswapd_waiters)) break; /* -- 2.34.1.75.gabe6bb3905 From 1b5ceb0aa36f7658d350ef9a9aadb77090d198a5 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Thu, 9 Apr 2020 00:20:25 -0700 Subject: [PATCH 10/18] mm: Fully disable watermark boosting when it isn't used The watermark boosting code still wakes kswapd even when there's no watermark boost in effect. Change it to only wake kswapd when there is actually a watermark boost. Signed-off-by: Sultan Alsawaf --- mm/page_alloc.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1756ecd05..6f61c3f2f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2675,8 +2675,11 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, * likelihood of future fallbacks. Wake kswapd now as the node * may be balanced overall and kswapd will not wake naturally. */ - if (boost_watermark(zone) && (alloc_flags & ALLOC_KSWAPD)) - set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags); + if (alloc_flags & ALLOC_KSWAPD) { + boost_watermark(zone); + if (zone->watermark_boost) + set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags); + } /* We are not allowed to try stealing from the whole block */ if (!whole_block) -- 2.34.1.75.gabe6bb3905 From be23ee3cb22d1b3365651b57c032509da5953aaa Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Wed, 20 May 2020 09:55:17 -0700 Subject: [PATCH 11/18] mm: Don't stop kswapd on a per-node basis when there are no waiters The page allocator wakes all kswapds in an allocation context's allowed nodemask in the slow path, so it doesn't make sense to have the kswapd- waiter count per each NUMA node. Instead, it should be a global counter to stop all kswapds when there are no failed allocation requests. Signed-off-by: Sultan Alsawaf --- include/linux/mmzone.h | 1 - mm/internal.h | 1 + mm/page_alloc.c | 8 ++++---- mm/vmscan.c | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 85be97c56..6a1d79d84 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -839,7 +839,6 @@ typedef struct pglist_data { unsigned long node_spanned_pages; /* total size of physical page range, including holes */ int node_id; - atomic_t kswapd_waiters; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by diff --git a/mm/internal.h b/mm/internal.h index cf3cb933e..77b3bce68 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -202,6 +202,7 @@ extern void prep_compound_page(struct page *page, unsigned int order); extern void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags); extern int user_min_free_kbytes; +extern atomic_long_t kswapd_waiters; extern void free_unref_page(struct page *page, unsigned int order); extern void free_unref_page_list(struct list_head *list); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6f61c3f2f..4fb88150d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -118,6 +118,8 @@ typedef int __bitwise fpi_t; */ #define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) +atomic_long_t kswapd_waiters = ATOMIC_LONG_INIT(0); + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) @@ -4886,7 +4888,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int no_progress_loops; unsigned int cpuset_mems_cookie; int reserve_flags; - pg_data_t *pgdat = ac->preferred_zoneref->zone->zone_pgdat; bool woke_kswapd = false; /* @@ -4923,7 +4924,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (alloc_flags & ALLOC_KSWAPD) { if (!woke_kswapd) { - atomic_inc(&pgdat->kswapd_waiters); + atomic_long_inc(&kswapd_waiters); woke_kswapd = true; } wake_all_kswapds(order, gfp_mask, ac); @@ -5134,7 +5135,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, fail: got_pg: if (woke_kswapd) - atomic_dec(&pgdat->kswapd_waiters); + atomic_long_dec(&kswapd_waiters); if (!page) warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); diff --git a/mm/vmscan.c b/mm/vmscan.c index e513736b2..0d7a5a61b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4083,7 +4083,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) ret = try_to_freeze(); __fs_reclaim_acquire(_THIS_IP_); if (ret || kthread_should_stop() || - !atomic_read(&pgdat->kswapd_waiters)) + !atomic_long_read(&kswapd_waiters)) break; /* -- 2.34.1.75.gabe6bb3905 From 456ce8ec00dfa7a483bbfb63e57aa447e77f6397 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Sat, 28 Mar 2020 13:06:28 -0700 Subject: [PATCH 12/18] mm: Disable watermark boosting by default What watermark boosting does is preemptively fire up kswapd to free memory when there hasn't been an allocation failure. It does this by increasing kswapd's high watermark goal and then firing up kswapd. The reason why this causes freezes is because, with the increased high watermark goal, kswapd will steal memory from processes that need it in order to make forward progress. These processes will, in turn, try to allocate memory again, which will cause kswapd to steal necessary pages from those processes again, in a positive feedback loop known as page thrashing. When page thrashing occurs, your system is essentially livelocked until the necessary forward progress can be made to stop processes from trying to continuously allocate memory and trigger kswapd to steal it back. This problem already occurs with kswapd *without* watermark boosting, but it's usually only encountered on machines with a small amount of memory and/or a slow CPU. Watermark boosting just makes the existing problem worse enough to notice on higher spec'd machines. Disable watermark boosting by default since it's a total dumpster fire. I can't imagine why anyone would want to explicitly enable it, but the option is there in case someone does. Signed-off-by: Sultan Alsawaf --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4fb88150d..968a02c6b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -340,7 +340,7 @@ compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = { int min_free_kbytes = 1024; int user_min_free_kbytes = -1; -int watermark_boost_factor __read_mostly = 15000; +int watermark_boost_factor __read_mostly; int watermark_scale_factor = 10; static unsigned long nr_kernel_pages __initdata; -- 2.34.1.75.gabe6bb3905 From 392f4665c32ded6585c2d939e020c452b225f61a Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Sun, 8 Mar 2020 00:31:35 -0800 Subject: [PATCH 13/18] Disable stack conservation for GCC There's plenty of room on the stack for a few more inlined bytes here and there. The measured stack usage at runtime is still safe without this, and performance is surely improved at a microscopic level, so remove it. Signed-off-by: Sultan Alsawaf --- Makefile | 5 ----- 1 file changed, 5 deletions(-) diff --git a/Makefile b/Makefile index e6d2ea920..47fef3b28 100644 --- a/Makefile +++ b/Makefile @@ -1014,11 +1014,6 @@ KBUILD_CFLAGS += -fno-strict-overflow # Make sure -fstack-check isn't enabled (like gentoo apparently did) KBUILD_CFLAGS += -fno-stack-check -# conserve stack if available -ifdef CONFIG_CC_IS_GCC -KBUILD_CFLAGS += -fconserve-stack -endif - # Prohibit date/time macros, which would make the build non-deterministic KBUILD_CFLAGS += -Werror=date-time -- 2.34.1.75.gabe6bb3905 From b4bb8ba925aad689b4dc37a4c212610b9d98fe25 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 28 Oct 2021 13:22:05 +1100 Subject: [PATCH 14/18] vfs: keep inodes with page cache off the inode shrinker LRU Historically (pre-2.5), the inode shrinker used to reclaim only empty inodes and skip over those that still contained page cache. This caused problems on highmem hosts: struct inode could put fill lowmem zones before the cache was getting reclaimed in the highmem zones. To address this, the inode shrinker started to strip page cache to facilitate reclaiming lowmem. However, this comes with its own set of problems: the shrinkers may drop actively used page cache just because the inodes are not currently open or dirty - think working with a large git tree. It further doesn't respect cgroup memory protection settings and can cause priority inversions between containers. Nowadays, the page cache also holds non-resident info for evicted cache pages in order to detect refaults. We've come to rely heavily on this data inside reclaim for protecting the cache workingset and driving swap behavior. We also use it to quantify and report workload health through psi. The latter in turn is used for fleet health monitoring, as well as driving automated memory sizing of workloads and containers, proactive reclaim and memory offloading schemes. The consequences of dropping page cache prematurely is that we're seeing subtle and not-so-subtle failures in all of the above-mentioned scenarios, with the workload generally entering unexpected thrashing states while losing the ability to reliably detect it. To fix this on non-highmem systems at least, going back to rotating inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 ("mm: don't reclaim inodes with many attached pages")) and failed (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many attached pages"")). The issue is mostly that shrinker pools attract pressure based on their size, and when objects get skipped the shrinkers remember this as deferred reclaim work. This accumulates excessive pressure on the remaining inodes, and we can quickly eat into heavily used ones, or dirty ones that require IO to reclaim, when there potentially is plenty of cold, clean cache around still. Instead, this patch keeps populated inodes off the inode LRU in the first place - just like an open file or dirty state would. An otherwise clean and unused inode then gets queued when the last cache entry disappears. This solves the problem without reintroducing the reclaim issues, and generally is a bit more scalable than having to wade through potentially hundreds of thousands of busy inodes. Locking is a bit tricky because the locks protecting the inode state (i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are serialized through i_lock, taken before the i_pages lock, to make sure depopulated inodes are queued reliably. Additions may race with deletions, but we'll check again in the shrinker. If additions race with the shrinker itself, we're protected by the i_lock: if find_inode() or iput() win, the shrinker will bail on the elevated i_count or I_REFERENCED; if the shrinker wins and goes ahead with the inode, it will set I_FREEING and inhibit further igets(), which will cause the other side to create a new instance of the inode instead. Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Cc: Roman Gushchin Cc: Tejun Heo Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Stephen Rothwell --- fs/inode.c | 46 +++++++++++++++++++++---------------- fs/internal.h | 1 - include/linux/fs.h | 1 + include/linux/pagemap.h | 50 +++++++++++++++++++++++++++++++++++++++++ mm/filemap.c | 8 +++++++ mm/truncate.c | 19 ++++++++++++++-- mm/vmscan.c | 7 ++++++ mm/workingset.c | 10 +++++++++ 8 files changed, 120 insertions(+), 22 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 9abc88d79..3eba0940f 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -428,11 +428,20 @@ void ihold(struct inode *inode) } EXPORT_SYMBOL(ihold); -static void inode_lru_list_add(struct inode *inode) +static void __inode_add_lru(struct inode *inode, bool rotate) { + if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE)) + return; + if (atomic_read(&inode->i_count)) + return; + if (!(inode->i_sb->s_flags & SB_ACTIVE)) + return; + if (!mapping_shrinkable(&inode->i_data)) + return; + if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru)) this_cpu_inc(nr_unused); - else + else if (rotate) inode->i_state |= I_REFERENCED; } @@ -443,16 +452,11 @@ static void inode_lru_list_add(struct inode *inode) */ void inode_add_lru(struct inode *inode) { - if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC | - I_FREEING | I_WILL_FREE)) && - !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE) - inode_lru_list_add(inode); + __inode_add_lru(inode, false); } - static void inode_lru_list_del(struct inode *inode) { - if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) this_cpu_dec(nr_unused); } @@ -728,10 +732,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) /* * Isolate the inode from the LRU in preparation for freeing it. * - * Any inodes which are pinned purely because of attached pagecache have their - * pagecache removed. If the inode has metadata buffers attached to - * mapping->private_list then try to remove them. - * * If the inode has the I_REFERENCED flag set, then it means that it has been * used recently - the flag is set in iput_final(). When we encounter such an * inode, clear the flag and move it to the back of the LRU so it gets another @@ -747,31 +747,39 @@ static enum lru_status inode_lru_isolate(struct list_head *item, struct inode *inode = container_of(item, struct inode, i_lru); /* - * we are inverting the lru lock/inode->i_lock here, so use a trylock. - * If we fail to get the lock, just skip it. + * We are inverting the lru lock/inode->i_lock here, so use a + * trylock. If we fail to get the lock, just skip it. */ if (!spin_trylock(&inode->i_lock)) return LRU_SKIP; /* - * Referenced or dirty inodes are still in use. Give them another pass - * through the LRU as we canot reclaim them now. + * Inodes can get referenced, redirtied, or repopulated while + * they're already on the LRU, and this can make them + * unreclaimable for a while. Remove them lazily here; iput, + * sync, or the last page cache deletion will requeue them. */ if (atomic_read(&inode->i_count) || - (inode->i_state & ~I_REFERENCED)) { + (inode->i_state & ~I_REFERENCED) || + !mapping_shrinkable(&inode->i_data)) { list_lru_isolate(lru, &inode->i_lru); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; } - /* recently referenced inodes get one more pass */ + /* Recently referenced inodes get one more pass */ if (inode->i_state & I_REFERENCED) { inode->i_state &= ~I_REFERENCED; spin_unlock(&inode->i_lock); return LRU_ROTATE; } + /* + * On highmem systems, mapping_shrinkable() permits dropping + * page cache in order to free up struct inodes: lowmem might + * be under pressure before the cache inside the highmem zone. + */ if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { __iget(inode); spin_unlock(&inode->i_lock); @@ -1638,7 +1646,7 @@ static void iput_final(struct inode *inode) if (!drop && !(inode->i_state & I_DONTCACHE) && (sb->s_flags & SB_ACTIVE)) { - inode_add_lru(inode); + __inode_add_lru(inode, true); spin_unlock(&inode->i_lock); return; } diff --git a/fs/internal.h b/fs/internal.h index 3cd065c8a..2854ff29f 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -149,7 +149,6 @@ extern int vfs_open(const struct path *, struct file *); * inode.c */ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); -extern void inode_add_lru(struct inode *inode); extern int dentry_needs_remove_privs(struct dentry *dentry); /* diff --git a/include/linux/fs.h b/include/linux/fs.h index 56eba7234..af7bee128 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3194,6 +3194,7 @@ static inline void remove_inode_hash(struct inode *inode) } extern void inode_sb_list_add(struct inode *inode); +extern void inode_add_lru(struct inode *inode); extern int sb_set_blocksize(struct super_block *, int); extern int sb_min_blocksize(struct super_block *, int); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 62db6b017..5c74a45ff 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -23,6 +23,56 @@ static inline bool mapping_empty(struct address_space *mapping) return xa_empty(&mapping->i_pages); } +/* + * mapping_shrinkable - test if page cache state allows inode reclaim + * @mapping: the page cache mapping + * + * This checks the mapping's cache state for the pupose of inode + * reclaim and LRU management. + * + * The caller is expected to hold the i_lock, but is not required to + * hold the i_pages lock, which usually protects cache state. That's + * because the i_lock and the list_lru lock that protect the inode and + * its LRU state don't nest inside the irq-safe i_pages lock. + * + * Cache deletions are performed under the i_lock, which ensures that + * when an inode goes empty, it will reliably get queued on the LRU. + * + * Cache additions do not acquire the i_lock and may race with this + * check, in which case we'll report the inode as shrinkable when it + * has cache pages. This is okay: the shrinker also checks the + * refcount and the referenced bit, which will be elevated or set in + * the process of adding new cache pages to an inode. + */ +static inline bool mapping_shrinkable(struct address_space *mapping) +{ + void *head; + + /* + * On highmem systems, there could be lowmem pressure from the + * inodes before there is highmem pressure from the page + * cache. Make inodes shrinkable regardless of cache state. + */ + if (IS_ENABLED(CONFIG_HIGHMEM)) + return true; + + /* Cache completely empty? Shrink away. */ + head = rcu_access_pointer(mapping->i_pages.xa_head); + if (!head) + return true; + + /* + * The xarray stores single offset-0 entries directly in the + * head pointer, which allows non-resident page cache entries + * to escape the shadow shrinker's list of xarray nodes. The + * inode shrinker needs to pick them up under memory pressure. + */ + if (!xa_is_node(head) && xa_is_value(head)) + return true; + + return false; +} + /* * Bits in mapping->flags. */ diff --git a/mm/filemap.c b/mm/filemap.c index 82a17c35e..b432165a7 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -262,9 +262,13 @@ void delete_from_page_cache(struct page *page) struct address_space *mapping = page_mapping(page); BUG_ON(!PageLocked(page)); + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __delete_from_page_cache(page, NULL); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); page_cache_free_page(mapping, page); } @@ -340,6 +344,7 @@ void delete_from_page_cache_batch(struct address_space *mapping, if (!pagevec_count(pvec)) return; + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); for (i = 0; i < pagevec_count(pvec); i++) { trace_mm_filemap_delete_from_page_cache(pvec->pages[i]); @@ -348,6 +353,9 @@ void delete_from_page_cache_batch(struct address_space *mapping, } page_cache_delete_batch(mapping, pvec); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); for (i = 0; i < pagevec_count(pvec); i++) page_cache_free_page(mapping, pvec->pages[i]); diff --git a/mm/truncate.c b/mm/truncate.c index 714eaf198..cc83a3f7c 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -45,9 +45,13 @@ static inline void __clear_shadow_entry(struct address_space *mapping, static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, void *entry) { + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __clear_shadow_entry(mapping, index, entry); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); } /* @@ -73,8 +77,10 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, return; dax = dax_mapping(mapping); - if (!dax) + if (!dax) { + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); + } for (i = j; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; @@ -93,8 +99,12 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, __clear_shadow_entry(mapping, index, page); } - if (!dax) + if (!dax) { xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); + } pvec->nr = j; } @@ -567,6 +577,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL)) return 0; + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); if (PageDirty(page)) goto failed; @@ -574,6 +585,9 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) BUG_ON(page_has_private(page)); __delete_from_page_cache(page, NULL); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); if (mapping->a_ops->freepage) mapping->a_ops->freepage(page); @@ -582,6 +596,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) return 1; failed: xa_unlock_irq(&mapping->i_pages); + spin_unlock(&mapping->host->i_lock); return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 0d7a5a61b..da279fefd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1105,6 +1105,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); + if (!PageSwapCache(page)) + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); /* * The non racy check for a busy page. @@ -1173,6 +1175,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, shadow = workingset_eviction(page, target_memcg); __delete_from_page_cache(page, shadow); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); if (freepage != NULL) freepage(page); @@ -1182,6 +1187,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, cannot_free: xa_unlock_irq(&mapping->i_pages); + if (!PageSwapCache(page)) + spin_unlock(&mapping->host->i_lock); return 0; } diff --git a/mm/workingset.c b/mm/workingset.c index d5b81e4f4..23df60ce2 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -543,6 +543,13 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, goto out; } + if (!spin_trylock(&mapping->host->i_lock)) { + xa_unlock(&mapping->i_pages); + spin_unlock_irq(lru_lock); + ret = LRU_RETRY; + goto out; + } + list_lru_isolate(lru, item); __dec_lruvec_kmem_state(node, WORKINGSET_NODES); @@ -562,6 +569,9 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, out_invalid: xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); ret = LRU_REMOVED_RETRY; out: cond_resched(); -- 2.34.1.75.gabe6bb3905 From 7a2c5f1ad1e3aa7adfe09341b6f29a1e280140b5 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 12 Nov 2021 08:19:50 -0800 Subject: [PATCH 15/18] x86/csum: rewrite csum_partial() With more NIC supporting CHECKSUM_COMPLETE, and IPv6 being widely used. csum_partial() is heavily used with small amount of bytes, and is consuming many cycles. IPv6 header size for instance is 40 bytes. Another thing to consider is that NET_IP_ALIGN is 0 on x86, meaning that network headers are not word-aligned, unless the driver forces this. This means that csum_partial() fetches one u16 to 'align the buffer', then perform three u64 additions with carry in a loop, then a remaining u32, then a remaining u16. With this new version, we perform a loop only for the 64 bytes blocks, then the remaining is bisected. Tested on various cpus, all of them show a big reduction in csum_partial() cost (by 50 to 80 %) v3: - use "+r" (temp64) asm constraints (Andrew). - fold do_csum() in csum_partial(), as gcc does not inline it. - fix bug added in v2 for the "odd" case. - back using addcq, as Andrew pointed the clang bug that was adding a stall on my hosts. (separate patch to add32_with_carry() will follow) - use load_unaligned_zeropad() for final 1-7 bytes (Peter & Alexander). v2: - removed the hard-coded switch(), as it was not RETPOLINE aware. - removed the final add32_with_carry() that we were doing in csum_partial(), we can simply pass @sum to do_csum(). Signed-off-by: Eric Dumazet Cc: Alexander Duyck Cc: Peter Zijlstra Cc: Andrew Cooper --- arch/x86/lib/csum-partial_64.c | 162 ++++++++++++++------------------- 1 file changed, 67 insertions(+), 95 deletions(-) diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c index e7925d668..5ec356269 100644 --- a/arch/x86/lib/csum-partial_64.c +++ b/arch/x86/lib/csum-partial_64.c @@ -9,6 +9,7 @@ #include #include #include +#include static inline unsigned short from32to16(unsigned a) { @@ -21,120 +22,92 @@ static inline unsigned short from32to16(unsigned a) } /* - * Do a 64-bit checksum on an arbitrary memory area. + * Do a checksum on an arbitrary memory area. * Returns a 32bit checksum. * * This isn't as time critical as it used to be because many NICs * do hardware checksumming these days. - * - * Things tried and found to not make it faster: - * Manual Prefetching - * Unrolling to an 128 bytes inner loop. - * Using interleaving with more registers to break the carry chains. + * + * Still, with CHECKSUM_COMPLETE this is called to compute + * checksums on IPv6 headers (40 bytes) and other small parts. + * it's best to have buff aligned on a 64-bit boundary */ -static unsigned do_csum(const unsigned char *buff, unsigned len) +__wsum csum_partial(const void *buff, int len, __wsum sum) { - unsigned odd, count; - unsigned long result = 0; + u64 temp64 = (__force u64)sum; + unsigned odd, result; - if (unlikely(len == 0)) - return result; odd = 1 & (unsigned long) buff; if (unlikely(odd)) { - result = *buff << 8; + if (unlikely(len == 0)) + return sum; + temp64 += (*(unsigned char *)buff << 8); len--; buff++; } - count = len >> 1; /* nr of 16-bit words.. */ - if (count) { - if (2 & (unsigned long) buff) { - result += *(unsigned short *)buff; - count--; - len -= 2; - buff += 2; - } - count >>= 1; /* nr of 32-bit words.. */ - if (count) { - unsigned long zero; - unsigned count64; - if (4 & (unsigned long) buff) { - result += *(unsigned int *) buff; - count--; - len -= 4; - buff += 4; - } - count >>= 1; /* nr of 64-bit words.. */ - /* main loop using 64byte blocks */ - zero = 0; - count64 = count >> 3; - while (count64) { - asm("addq 0*8(%[src]),%[res]\n\t" - "adcq 1*8(%[src]),%[res]\n\t" - "adcq 2*8(%[src]),%[res]\n\t" - "adcq 3*8(%[src]),%[res]\n\t" - "adcq 4*8(%[src]),%[res]\n\t" - "adcq 5*8(%[src]),%[res]\n\t" - "adcq 6*8(%[src]),%[res]\n\t" - "adcq 7*8(%[src]),%[res]\n\t" - "adcq %[zero],%[res]" - : [res] "=r" (result) - : [src] "r" (buff), [zero] "r" (zero), - "[res]" (result)); - buff += 64; - count64--; - } + while (unlikely(len >= 64)) { + asm("addq 0*8(%[src]),%[res]\n\t" + "adcq 1*8(%[src]),%[res]\n\t" + "adcq 2*8(%[src]),%[res]\n\t" + "adcq 3*8(%[src]),%[res]\n\t" + "adcq 4*8(%[src]),%[res]\n\t" + "adcq 5*8(%[src]),%[res]\n\t" + "adcq 6*8(%[src]),%[res]\n\t" + "adcq 7*8(%[src]),%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [src] "r" (buff) + : "memory"); + buff += 64; + len -= 64; + } + + if (len & 32) { + asm("addq 0*8(%[src]),%[res]\n\t" + "adcq 1*8(%[src]),%[res]\n\t" + "adcq 2*8(%[src]),%[res]\n\t" + "adcq 3*8(%[src]),%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [src] "r" (buff) + : "memory"); + buff += 32; + } + if (len & 16) { + asm("addq 0*8(%[src]),%[res]\n\t" + "adcq 1*8(%[src]),%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [src] "r" (buff) + : "memory"); + buff += 16; + } + if (len & 8) { + asm("addq 0*8(%[src]),%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [src] "r" (buff) + : "memory"); + buff += 8; + } + if (len & 7) { + unsigned int shift = (8 - (len & 7)) * 8; + unsigned long trail; - /* last up to 7 8byte blocks */ - count %= 8; - while (count) { - asm("addq %1,%0\n\t" - "adcq %2,%0\n" - : "=r" (result) - : "m" (*(unsigned long *)buff), - "r" (zero), "0" (result)); - --count; - buff += 8; - } - result = add32_with_carry(result>>32, - result&0xffffffff); + trail = (load_unaligned_zeropad(buff) << shift) >> shift; - if (len & 4) { - result += *(unsigned int *) buff; - buff += 4; - } - } - if (len & 2) { - result += *(unsigned short *) buff; - buff += 2; - } + asm("addq %[trail],%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [trail] "r" (trail)); } - if (len & 1) - result += *buff; - result = add32_with_carry(result>>32, result & 0xffffffff); + result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff); if (unlikely(odd)) { result = from32to16(result); result = ((result >> 8) & 0xff) | ((result & 0xff) << 8); } - return result; -} - -/* - * computes the checksum of a memory block at buff, length len, - * and adds in "sum" (32-bit) - * - * returns a 32-bit number suitable for feeding into itself - * or csum_tcpudp_magic - * - * this function must be called with even lengths, except - * for the last fragment, which may be odd - * - * it's best to have buff aligned on a 64-bit boundary - */ -__wsum csum_partial(const void *buff, int len, __wsum sum) -{ - return (__force __wsum)add32_with_carry(do_csum(buff, len), - (__force u32)sum); + return (__force __wsum)result; } EXPORT_SYMBOL(csum_partial); @@ -147,4 +120,3 @@ __sum16 ip_compute_csum(const void *buff, int len) return csum_fold(csum_partial(buff,len,0)); } EXPORT_SYMBOL(ip_compute_csum); - -- 2.34.1.75.gabe6bb3905 From 8a4472808bf9469312cd55ca404a6bb24b452945 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 18 Nov 2021 09:52:39 -0800 Subject: [PATCH 16/18] x86/csum: Fix compilation error for UM load_unaligned_zeropad() is not yet universal. ARCH=um SUBARCH=x86_64 builds do not have it. When CONFIG_DCACHE_WORD_ACCESS is not set, simply continue the bisection with 4, 2 and 1 byte steps. Fixes: df4554cebdaa ("x86/csum: Rewrite/optimize csum_partial()") Reported-by: kernel test robot Signed-off-by: Eric Dumazet Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20211118175239.1525650-1-eric.dumazet@gmail.com --- arch/x86/lib/csum-partial_64.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c index 5ec356269..1eb8f2d11 100644 --- a/arch/x86/lib/csum-partial_64.c +++ b/arch/x86/lib/csum-partial_64.c @@ -92,6 +92,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) buff += 8; } if (len & 7) { +#ifdef CONFIG_DCACHE_WORD_ACCESS unsigned int shift = (8 - (len & 7)) * 8; unsigned long trail; @@ -101,6 +102,31 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) "adcq $0,%[res]" : [res] "+r" (temp64) : [trail] "r" (trail)); +#else + if (len & 4) { + asm("addq %[val],%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [val] "r" ((u64)*(u32 *)buff) + : "memory"); + buff += 4; + } + if (len & 2) { + asm("addq %[val],%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [val] "r" ((u64)*(u16 *)buff) + : "memory"); + buff += 2; + } + if (len & 1) { + asm("addq %[val],%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [val] "r" ((u64)*(u8 *)buff) + : "memory"); + } +#endif } result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff); if (unlikely(odd)) { -- 2.34.1.75.gabe6bb3905 From 7126c1e0cf6882bfbb1d0b3d890556061778e5d2 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 25 Nov 2021 06:18:17 -0800 Subject: [PATCH 17/18] x86/csum: Fix initial seed for odd buffers When I folded do_csum() into csum_partial(), I missed that we had to swap odd/even bytes from @sum argument. This is because this swap will happen again at the end of the function. [A, B, C, D] -> [B, A, D, C] As far as Internet checksums (rfc 1071) are concerned, we can instead rotate the whole 32bit value by 8 (or 24) -> [D, A, B, C] Note that I played with the idea of replacing this final swapping: result = from32to16(result); result = ((result >> 8) & 0xff) | ((result & 0xff) << 8); With: result = ror32(result, 8); But while the generated code was definitely better for the odd case, run time cost for the more likely even case was not better for gcc. gcc is replacing a well predicted conditional branch with a cmov instruction after a ror instruction which adds a cost canceling the cmov gain. Many thanks to Noah Goldstein for reporting this issue. [ dhansen: * spelling: swaping => swapping * updated Fixes commit ] Cc: Peter Zijlstra (Intel) Fixes: d31c3c683ee6 ("x86/csum: Rewrite/optimize csum_partial()") Reported-by: Noah Goldstein Signed-off-by: Eric Dumazet Signed-off-by: Dave Hansen Link: https://lkml.kernel.org/r/20211125141817.3541501-1-eric.dumazet@gmail.com --- arch/x86/lib/csum-partial_64.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c index 1eb8f2d11..40b527ba1 100644 --- a/arch/x86/lib/csum-partial_64.c +++ b/arch/x86/lib/csum-partial_64.c @@ -41,6 +41,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) if (unlikely(odd)) { if (unlikely(len == 0)) return sum; + temp64 = ror32((__force u32)sum, 8); temp64 += (*(unsigned char *)buff << 8); len--; buff++; -- 2.34.1.75.gabe6bb3905 From acf1dd1488b74a203e1f032ca754a4d039219215 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Thu, 16 Dec 2021 11:17:09 +1100 Subject: [PATCH 18/18] xfs: check sb_meta_uuid for dabuf buffer recovery Got a report that a repeated crash test of a container host would eventually fail with a log recovery error preventing the system from mounting the root filesystem. It manifested as a directory leaf node corruption on writeback like so: XFS (loop0): Mounting V5 Filesystem XFS (loop0): Starting recovery (logdev: internal) XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b ........=....... 00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc .......X...).... 00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23 ..x..~J}.S...G.# 00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00 .........C...... 00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a ................ 00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50 .5y....0.......P 00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4 .@.......A...... 00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c .b.......P!A.... XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514). Shutting down. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed Tracing indicated that we were recovering changes from a transaction at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57. That is, log recovery was overwriting a buffer with newer changes on disk than was in the transaction. Tracing indicated that we were hitting the "recovery immediately" case in xlog_recover_get_buf_lsn(), and hence it was ignoring the LSN in the buffer. The code was extracting the LSN correctly, then ignoring it because the UUID in the buffer did not match the superblock UUID. The problem arises because the UUID check uses the wrong UUID - it should be checking the sb_meta_uuid, not sb_uuid. This filesystem has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the correct matching sb_meta_uuid in it, it's just the code checked it against the wrong superblock uuid. The is no corruption in the filesystem, and failing to recover the buffer due to a write verifier failure means the recovery bug did not propagate the corruption to disk. Hence there is no corruption before or after this bug has manifested, the impact is limited simply to an unmountable filesystem.... This was missed back in 2015 during an audit of incorrect sb_uuid usage that resulted in commit fcfbe2c4ef42 ("xfs: log recovery needs to validate against sb_meta_uuid") that fixed the magic32 buffers to validate against sb_meta_uuid instead of sb_uuid. It missed the magicda buffers.... Fixes: ce748eaa65f2 ("xfs: create new metadata UUID field and incompat flag") Signed-off-by: Dave Chinner --- fs/xfs/xfs_buf_item_recover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c index a476c7ef5..991fbf1eb 100644 --- a/fs/xfs/xfs_buf_item_recover.c +++ b/fs/xfs/xfs_buf_item_recover.c @@ -816,7 +816,7 @@ xlog_recover_get_buf_lsn( } if (lsn != (xfs_lsn_t)-1) { - if (!uuid_equal(&mp->m_sb.sb_uuid, uuid)) + if (!uuid_equal(&mp->m_sb.sb_meta_uuid, uuid)) goto recover_immediately; return lsn; } -- 2.34.1.75.gabe6bb3905