[00:01:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054071 (owner: 10TrainBranchBot) [00:14:50] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:15:44] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29699 bytes in 3.261 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:45:12] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 216036000 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:12] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7136 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:31] (03PS3) 10Dbrant: Enable account vanishing in CentralAuth (labs). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) [01:20:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 332.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:26:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T367856)', diff saved to https://phabricator.wikimedia.org/P66467 and previous config saved to /var/cache/conftool/dbconfig/20240715-012559-marostegui.json [01:26:04] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:34:20] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 391.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:41:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P66469 and previous config saved to /var/cache/conftool/dbconfig/20240715-014106-marostegui.json [01:53:50] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:55:20] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 47.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:56:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P66470 and previous config saved to /var/cache/conftool/dbconfig/20240715-015613-marostegui.json [01:56:44] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29698 bytes in 2.501 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [02:01:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:11:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T367856)', diff saved to https://phabricator.wikimedia.org/P66471 and previous config saved to /var/cache/conftool/dbconfig/20240715-021121-marostegui.json [02:11:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [02:11:25] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [02:11:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [02:15:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 51.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:39:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:51:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:59:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:23:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [03:28:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [03:29:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 430.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:40:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [03:41:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 57.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:45:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [04:09:45] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [04:12:21] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPV6 for dbproxy200[5-8] - pt1979@cumin2002" [04:13:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPV6 for dbproxy200[5-8] - pt1979@cumin2002" [04:13:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:18:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 394.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:34:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:39:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:47:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [04:47:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [04:47:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T367856)', diff saved to https://phabricator.wikimedia.org/P66472 and previous config saved to /var/cache/conftool/dbconfig/20240715-044723-marostegui.json [04:47:28] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:05:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1054076 (https://phabricator.wikimedia.org/T370019) [05:05:20] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1054077 (https://phabricator.wikimedia.org/T370019) [05:12:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host dbproxy2005.codfw.wmnet with OS bookworm [05:12:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9980121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host dbproxy2005.codfw.wmnet with OS bookworm [05:23:14] (03PS1) 10Marostegui: an-redacteddb1001.yaml: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1054078 [05:23:49] (03CR) 10Marostegui: [C:03+2] an-redacteddb1001.yaml: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1054078 (owner: 10Marostegui) [05:25:09] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [05:25:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9980122 (10Marostegui) @Papaul the interface wasn't in netbox anymore, but the DNS entry for that host is still gone. I've tried to reimage the host but it gets stuck on the... [05:27:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9980123 (10Marostegui) Just talked to @papaul - the reimage was expected to fail since the iface was moved back to the 10G one. [05:39:08] (03PS1) 10Kevin Bazira: ml-services: readability_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054080 (https://phabricator.wikimedia.org/T369344) [05:43:04] (03PS1) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) [05:43:06] (03PS1) 10Giuseppe Lavagetto: varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480) [05:43:08] (03PS1) 10Giuseppe Lavagetto: varnish: actually include the requestctl hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054083 (https://phabricator.wikimedia.org/T369480) [05:46:11] (03CR) 10CI reject: [V:04-1] varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto) [05:46:16] (03CR) 10CI reject: [V:04-1] varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480) (owner: 10Giuseppe Lavagetto) [05:53:31] (03PS1) 10NMW03: Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) [05:54:08] (03CR) 10CI reject: [V:04-1] Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) (owner: 10NMW03) [06:01:41] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054086 (https://phabricator.wikimedia.org/T349774) [06:03:51] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054086 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:11] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054086 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [06:06:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 9.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:06:27] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [06:06:48] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [06:06:49] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [06:07:23] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [06:07:24] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [06:07:51] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:22:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool db2136', diff saved to https://phabricator.wikimedia.org/P66473 and previous config saved to /var/cache/conftool/dbconfig/20240715-062216-root.json [06:22:26] (03CR) 10Arnaudb: [C:03+1] db1179: Disable notification for db1179 [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) (owner: 10Ladsgroup) [06:23:11] (03CR) 10Marostegui: "This will only work once the host is back up (so puppet runs), meanwhile I'd suggest to extend the downtime" [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) (owner: 10Ladsgroup) [06:25:58] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:26:50] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29689 bytes in 1.384 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:30:31] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3222/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049761 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [06:31:47] (03CR) 10Slyngshede: [V:03+1 C:03+2] R:idp New CAS 7 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1049761 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [06:48:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9980169 (10Marostegui) [06:52:03] !log test [06:59:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:57] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp1004.wikimedia.org [07:00:59] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [07:03:16] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1004.wikimedia.org - slyngshede@cumin1002" [07:04:21] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1004.wikimedia.org - slyngshede@cumin1002" [07:04:21] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:04:22] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp1004.wikimedia.org on all recursors [07:04:25] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp1004.wikimedia.org on all recursors [07:04:51] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1004.wikimedia.org - slyngshede@cumin1002" [07:05:50] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1004.wikimedia.org - slyngshede@cumin1002" [07:06:20] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp1004.wikimedia.org with OS bookworm [07:06:32] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp1004.wikimedia.org with OS bookworm [07:08:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:13:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:16:18] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980175 (10SLyngshede-WMF) a:03SLyngshede-WMF [07:17:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855 [07:17:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855 [07:17:54] T369855: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855 [07:17:56] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp1004.wikimedia.org with reason: host reimage [07:18:06] (03CR) 10Arnaudb: [C:03+1] "{{done}}" [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) (owner: 10Ladsgroup) [07:18:48] FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:21:10] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp1004.wikimedia.org with reason: host reimage [07:22:56] (03CR) 10Marostegui: [C:03+2] db1179: Disable notification for db1179 [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) (owner: 10Ladsgroup) [07:24:00] (03CR) 10Elukey: [C:03+2] cfssl: add a condition to cfssl_ocsprefresh.py [puppet] - 10https://gerrit.wikimedia.org/r/1053913 (https://phabricator.wikimedia.org/T363829) (owner: 10Elukey) [07:24:16] (03PS2) 10Jelto: gitlab: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) [07:28:01] (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetserver::gitprivate: fix post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [07:28:09] (03CR) 10Elukey: [C:03+2] profile::tcpircbot: allow inbound conn from puppetserver nodes [puppet] - 10https://gerrit.wikimedia.org/r/1053616 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [07:28:18] (03CR) 10Elukey: [C:03+2] profile::kerberos::kadminserver: allow more nodes in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [07:29:27] (03PS2) 10Ladsgroup: mariadb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/761927 (https://phabricator.wikimedia.org/T297605) [07:30:57] (03Abandoned) 10Marostegui: mariadb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/761927 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [07:34:19] (03PS1) 10Marostegui: packages_wmf.pp: Remove Buster support [puppet] - 10https://gerrit.wikimedia.org/r/1054271 [07:35:26] (03CR) 10Marostegui: [C:04-2] "Pending checking if there are Busters in WMCS land" [puppet] - 10https://gerrit.wikimedia.org/r/1054271 (owner: 10Marostegui) [07:36:51] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp1004.wikimedia.org with OS bookworm [07:36:51] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp1004.wikimedia.org [07:36:59] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980227 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp1004.wikimedia.org with OS bookworm completed: - idp1004 (**PASS*... [07:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:44:45] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3223/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [07:46:16] (03CR) 10Jelto: [V:03+1] gitlab: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [07:51:45] (03CR) 10Fabfur: varnish: add requestctl filters for cache hits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto) [07:53:37] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp2004.wikimedia.org [07:53:38] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [07:55:57] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2004.wikimedia.org - slyngshede@cumin1002" [07:57:14] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2004.wikimedia.org - slyngshede@cumin1002" [07:57:14] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:57:15] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp2004.wikimedia.org on all recursors [07:57:18] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp2004.wikimedia.org on all recursors [07:57:52] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2004.wikimedia.org - slyngshede@cumin1002" [07:58:51] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2004.wikimedia.org - slyngshede@cumin1002" [08:00:14] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980252 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp2004.wikimedia.org with OS bookworm [08:01:33] (03PS3) 10Jelto: gitlab: switch gitlab from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) [08:01:39] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit) [08:04:34] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s7 T369882 [08:04:37] T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882 [08:05:17] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T369882 [08:09:49] !log volans@cumin2002 dbctl commit (dc=all): 'Set db2218 with weight 0 T369882', diff saved to https://phabricator.wikimedia.org/P66474 and previous config saved to /var/cache/conftool/dbconfig/20240715-080948-volans.json [08:09:53] T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882 [08:12:54] !log volans@cumin2002 dbctl commit (dc=all): 'Remove db2218 from API T369882', diff saved to https://phabricator.wikimedia.org/P66475 and previous config saved to /var/cache/conftool/dbconfig/20240715-081252-volans.json [08:13:22] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp2004.wikimedia.org with reason: host reimage [08:16:28] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp2004.wikimedia.org with reason: host reimage [08:18:19] (03CR) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto) [08:18:29] (03CR) 10Jelto: "unfortunately I was not able to do that. firewall::service expects a array of Array[Stdlib::IP::Address] and not a String. But if we deplo" [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:19:10] (03PS2) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) [08:19:10] (03PS2) 10Giuseppe Lavagetto: varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480) [08:19:10] (03PS2) 10Giuseppe Lavagetto: varnish: actually include the requestctl hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054083 (https://phabricator.wikimedia.org/T369480) [08:20:19] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3224/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:21:59] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 52468 [08:22:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 52468 [08:25:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [08:28:56] (03PS1) 10Elukey: profile::tcpircbot: allow puppetservers to contact tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1054273 (https://phabricator.wikimedia.org/T368023) [08:30:15] (03PS3) 10Jelto: gitlab: introduce log rotation settings [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) [08:30:33] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3225/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054273 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:30:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [08:31:02] (03CR) 10FNegri: [C:03+1] "I did a quick search in operations/puppet and cloud/instance-puppet for the class profile::mariadb::packages_wmf" [puppet] - 10https://gerrit.wikimedia.org/r/1054271 (owner: 10Marostegui) [08:31:29] (03CR) 10Marostegui: packages_wmf.pp: Remove Buster support [puppet] - 10https://gerrit.wikimedia.org/r/1054271 (owner: 10Marostegui) [08:31:38] (03CR) 10Marostegui: [C:03+2] packages_wmf.pp: Remove Buster support [puppet] - 10https://gerrit.wikimedia.org/r/1054271 (owner: 10Marostegui) [08:32:29] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:32:35] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3226/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto) [08:32:55] (03CR) 10Jelto: [V:03+1] gitlab: introduce log rotation settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto) [08:33:21] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp2004.wikimedia.org with OS bookworm [08:33:21] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp2004.wikimedia.org [08:33:30] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980308 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp2004.wikimedia.org with OS bookworm completed: - idp2004 (**PASS*... [08:34:51] (03PS1) 10Btullis: Add an-redacteddb1001 to the mysql eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453) [08:35:00] (03PS3) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) [08:35:00] (03PS3) 10Giuseppe Lavagetto: varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480) [08:35:00] (03PS3) 10Giuseppe Lavagetto: varnish: actually include the requestctl hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054083 (https://phabricator.wikimedia.org/T369480) [08:35:25] (03PS1) 10Slyngshede: P:idp Add idp2004 to CAS 7 cluster. [puppet] - 10https://gerrit.wikimedia.org/r/1054277 (https://phabricator.wikimedia.org/T367487) [08:35:56] (03PS2) 10Btullis: Add an-redacteddb1001 to the mysql eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453) [08:36:48] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3227/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [08:39:16] (03CR) 10Marostegui: [C:03+1] Add an-redacteddb1001 to the mysql eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [08:39:33] (03Abandoned) 10Btullis: Revert the change to disable the gobbin timers on an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1052945 (https://phabricator.wikimedia.org/T365503) (owner: 10Btullis) [08:39:54] (03CR) 10Btullis: [V:03+1 C:03+2] Add an-redacteddb1001 to the mysql eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [08:41:07] (03PS2) 10Slyngshede: Styling: Allow the use of normal Codex tables. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052923 [08:41:07] (03PS3) 10Slyngshede: Permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 [08:42:30] (03CR) 10Urbanecm: [C:03+1] "no objection; i'm wondering whether we should have separate hiera keys for lists that are synced as dry-run and that are synced for real. " [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [08:42:58] (03CR) 10CI reject: [V:04-1] Permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 (owner: 10Slyngshede) [08:44:00] (03CR) 10Urbanecm: [C:03+1] mailman3: defined type to sync list members, create timers for each list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [08:44:22] (03CR) 10Elukey: [C:03+1] "The status in netbox seems to be "unknown", at least from what puppet reports in its motd. Expected?" [puppet] - 10https://gerrit.wikimedia.org/r/1054277 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:45:40] (03CR) 10Slyngshede: "I've only JUST created it, so some lag maybe? Anyway, DNS is there, and that's the bit that's required." [puppet] - 10https://gerrit.wikimedia.org/r/1054277 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:45:45] (03CR) 10Slyngshede: [C:03+2] P:idp Add idp2004 to CAS 7 cluster. [puppet] - 10https://gerrit.wikimedia.org/r/1054277 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:46:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#9980350 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert [08:46:35] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1053827 (https://phabricator.wikimedia.org/T369882) (owner: 10Gerrit maintenance bot) [08:47:09] (03CR) 10Arnaudb: [C:03+1] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1053827 (https://phabricator.wikimedia.org/T369882) (owner: 10Gerrit maintenance bot) [08:47:14] (03PS1) 10Clément Goubert: data.yaml: Add krb access for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1054278 (https://phabricator.wikimedia.org/T369517) [08:48:14] (03CR) 10Volans: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1053827 (https://phabricator.wikimedia.org/T369882) (owner: 10Gerrit maintenance bot) [08:48:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE)) [08:49:34] 06SRE, 06collaboration-services: gitlab2002: wrong network for pulic IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#9980370 (10Clement_Goubert) [08:51:16] !log Starting s7 codfw failover from db2121 to db2218 - T369882 [08:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:19] T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882 [08:54:52] (03CR) 10Slyngshede: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1054278 (https://phabricator.wikimedia.org/T369517) (owner: 10Clément Goubert) [08:54:57] (03CR) 10Slyngshede: [C:03+1] data.yaml: Add krb access for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1054278 (https://phabricator.wikimedia.org/T369517) (owner: 10Clément Goubert) [08:55:17] (03CR) 10Clément Goubert: [C:03+2] data.yaml: Add krb access for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1054278 (https://phabricator.wikimedia.org/T369517) (owner: 10Clément Goubert) [08:55:28] (03CR) 10Ayounsi: [C:03+1] "yep indeed !" [homer/public] - 10https://gerrit.wikimedia.org/r/1053935 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [08:56:54] !log volans@cumin1002 dbctl commit (dc=all): 'Promote db2218 to s7 primary T369882', diff saved to https://phabricator.wikimedia.org/P66477 and previous config saved to /var/cache/conftool/dbconfig/20240715-085654-volans.json [08:56:58] T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882 [09:02:54] (03CR) 10Filippo Giunchedi: [C:03+1] profile::tcpircbot: allow puppetservers to contact tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1054273 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [09:03:06] (03CR) 10Elukey: [V:03+1 C:03+2] profile::tcpircbot: allow puppetservers to contact tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1054273 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [09:05:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#9980431 (10Clement_Goubert) 05In progress→03Resolved p:05Triage→03Medium @XiaoXiao-WMF You should have received an email with instructions on... [09:05:33] !log volans@cumin1002 dbctl commit (dc=all): 'Depool db2121 T369882', diff saved to https://phabricator.wikimedia.org/P66478 and previous config saved to /var/cache/conftool/dbconfig/20240715-090532-volans.json [09:05:37] T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882 [09:06:56] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: introduce log rotation settings [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto) [09:08:29] (03CR) 10Filippo Giunchedi: mysql: replication lag monitoring threshold and severity change (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [09:09:57] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9980442 (10Clement_Goubert) 05Open→03Resolved Resolving this as it seems everything is in order. Don't hesitate to reopen should you encounter any issues. [09:14:40] (03PS1) 10Marostegui: db2121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054281 [09:14:44] !log elukey@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d3-codfw [09:14:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Long schema change [09:15:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Long schema change [09:15:17] (03CR) 10Marostegui: [C:03+2] db2121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054281 (owner: 10Marostegui) [09:15:56] !log Deploy schema change on s7 codfw db2121 dbmaint T367856 [09:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:00] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:16:54] !log elukey@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-d3-codfw [09:17:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:17:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:18:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T367856)', diff saved to https://phabricator.wikimedia.org/P66479 and previous config saved to /var/cache/conftool/dbconfig/20240715-091800-marostegui.json [09:18:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Long schema change [09:18:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Long schema change [09:19:03] !log Deploy schema change on s7 eqiad db1170 dbmaint T367856 [09:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:44] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:23:26] (03PS1) 10Clément Goubert: Revert^2 "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054283 [09:25:20] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:29:05] !log manually removing mw1349.eqiad.wmnet mw1350.eqiad.wmnet mw1351.eqiad.wmnet from k8s following reimage to videoscalers - T351074 [09:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:09] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [09:33:33] RESOLVED: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:38:36] (03PS4) 10Ayounsi: Netbox 4: create parent directories [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) [09:38:37] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9980527 (10Joe) To clarify a bit, I didn't take the route described in the task. In fact, we want: *... [09:41:16] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9980543 (10Joe) a:03Joe [09:41:25] (03CR) 10Ayounsi: Netbox 4: create parent directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:41:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [09:41:47] (03CR) 10Ayounsi: [C:03+2] Netbox 4: create parent directories [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:42:01] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:46:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [09:50:51] (03CR) 10Ayounsi: [C:03+2] Netbox 4: create customscript parent directory as well [puppet] - 10https://gerrit.wikimedia.org/r/1048402 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:53:47] (03PS1) 10Filippo Giunchedi: o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288 [09:54:33] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 49544 [09:55:23] (03CR) 10CI reject: [V:04-1] o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288 (owner: 10Filippo Giunchedi) [09:56:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 49544 [09:57:35] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61941 [09:58:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61941 [09:58:22] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 262293 [09:58:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262293 [09:58:43] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 270361 [09:58:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270361 [09:59:19] (03PS1) 10Elukey: profile::pki::multirootca: use info in the client auth vhost [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750) [09:59:26] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52999 [09:59:38] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52999 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1000) [10:00:14] (03PS2) 10Elukey: profile::pki::multirootca: use info in the client auth vhost [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750) [10:01:16] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3228/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [10:01:40] (03CR) 10Ayounsi: [C:03+1] profile::pki::multirootca: use info in the client auth vhost [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [10:04:49] (03CR) 10Elukey: [V:03+1 C:03+2] profile::pki::multirootca: use info in the client auth vhost [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [10:14:53] (03PS2) 10Filippo Giunchedi: o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288 [10:20:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [10:20:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [10:21:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [10:21:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T367856)', diff saved to https://phabricator.wikimedia.org/P66480 and previous config saved to /var/cache/conftool/dbconfig/20240715-102117-marostegui.json [10:21:21] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:23:36] (03PS1) 10Btullis: Correct the signing key for the yarn apt repo [puppet] - 10https://gerrit.wikimedia.org/r/1054296 (https://phabricator.wikimedia.org/T365839) [10:24:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:24:19] (03CR) 10Btullis: [C:03+2] Correct the signing key for the yarn apt repo [puppet] - 10https://gerrit.wikimedia.org/r/1054296 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [10:25:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [10:29:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:31:18] arnaudb: bunch of 1205 lock wait timeout exceeded errors on mw-jobrunners during two minutes, looks like only commons [10:32:04] It's stopped now, but it was a good 1k erros [10:33:56] thanks for the heads up claime, cc marostegui [10:37:49] claime: mwjobrunners are hitting clouddb, no? [10:42:21] arnaudb: err I don't think they should [10:42:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed in moss-be2002 - https://phabricator.wikimedia.org/T370034 (10MatthewVernon) 03NEW [10:42:42] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed in moss-be2002 - https://phabricator.wikimedia.org/T370034#9980767 (10MatthewVernon) p:05Triage→03Medium [10:50:04] (03CR) 10Clément Goubert: "> > Do you expect wgMetricsPlatformInstrumentConfiguratorBaseUrl to be different per-wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [10:51:14] (03PS1) 10Marostegui: db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054303 [10:51:49] (03CR) 10Marostegui: [C:03+2] db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054303 (owner: 10Marostegui) [10:59:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:02:35] (03CR) 10Kamila Součková: [C:03+1] Revert^2 "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054283 (owner: 10Clément Goubert) [11:02:46] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054283 (owner: 10Clément Goubert) [11:03:51] (03Merged) 10jenkins-bot: Revert^2 "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054283 (owner: 10Clément Goubert) [11:08:13] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:08:33] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:09:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:30] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:11:02] !log Increasing webVideoTranscodePrioritized concurrency in changeprop-jobqueue [11:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:07] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:11:47] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:12:49] (03CR) 10Urbanecm: "reviewing per a request from Seddon :). logged a few questions inline!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant) [11:14:25] RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:03] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#9980861 (10MatthewVernon) So, looking at [[ https://netbox.wikimedia.org/dcim/devices/?q=ms-be2&sort=rack | netbox ]], hosts are distributed in cod... [11:24:41] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:25:07] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:29:18] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:21] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:30:36] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:35] !log Reboot stashbot [11:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:17] !log test [11:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:15] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29689 bytes in 2.431 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:36:49] (03CR) 10Seddon: Enable account vanishing in CentralAuth (labs). (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant) [11:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:37:53] jouncebot: nowandnext [11:37:53] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [11:37:53] In 1 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1300) [11:38:37] (03PS4) 10Jelto: gitlab: switch gitlab from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) [11:39:06] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9980908 (10Volans) [11:40:04] (03CR) 10Urbanecm: [C:03+2] "Makes sense. Seddon mentioned this is becoming essential, and none of the questions logged is a critical one, so let's ship this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant) [11:40:27] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9980911 (10Volans) Confirmed it's all good for this specific task, marked as such in the task description. [11:40:41] (03Merged) 10jenkins-bot: Enable account vanishing in CentralAuth (labs). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant) [11:42:02] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3229/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [11:44:18] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:44] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9980918 (10Clement_Goubert) a:03KFrancis @KFrancis can you please confirm NDA status? [11:49:18] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:57:56] (03PS1) 10DCausse: team-search-platform: migrate cirrus_cluster_checks [alerts] - 10https://gerrit.wikimedia.org/r/1054317 (https://phabricator.wikimedia.org/T359033) [11:58:01] (03PS5) 10AOkoth: vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) [11:59:46] (03PS2) 10DCausse: rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053734 [12:00:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [12:04:39] I'm investingating those otelcollector alerts btw [12:05:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [12:07:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [12:12:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [12:15:38] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:16:01] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:20:36] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:49] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054328 [12:24:00] (03PS1) 10Stevemunene: Upgrade airflow test instance version to v2.9.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054329 (https://phabricator.wikimedia.org/T365449) [12:24:18] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:22] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053734 (owner: 10DCausse) [12:27:15] (03Merged) 10jenkins-bot: rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053734 (owner: 10DCausse) [12:30:08] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:30:10] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:30:45] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:30:47] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:37:10] (03PS1) 10KartikMistry: Update cxserver to 2024-07-15-100650-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054340 (https://phabricator.wikimedia.org/T354666) [12:39:14] (03CR) 10Klausman: [C:03+2] ml: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [12:39:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [12:39:51] (03Merged) 10jenkins-bot: ml: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [12:40:12] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:41:00] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:41:32] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:41:40] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:43:54] (03CR) 10Vgutierrez: "looking good, just an inline question about templates" [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto) [12:44:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [12:51:26] (03CR) 10Vgutierrez: [C:04-1] Add public suffix list module (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [12:52:55] (03CR) 10Vgutierrez: ncmonitor: Set path for public suffix domain list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [12:54:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:24] (03PS14) 10Stevemunene: wdqs: add main and scholarly puppet config [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) [12:54:25] (03PS1) 10Stevemunene: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) [12:55:49] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:56:21] (03PS1) 10MVernon: hiera: mark apus service as in production [puppet] - 10https://gerrit.wikimedia.org/r/1054344 (https://phabricator.wikimedia.org/T279621) [12:58:23] (03PS2) 10Lucas Werkmeister (WMDE): Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) [12:58:24] (03CR) 10Vgutierrez: [C:04-1] "sorry about that, I was under the wrong impression that you took care of it" [puppet] - 10https://gerrit.wikimedia.org/r/1053937 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey) [12:59:08] (03PS3) 10Lucas Werkmeister (WMDE): Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) [12:59:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:28] (03CR) 10Lucas Werkmeister (WMDE): "(PS3 just adds a trailing comma 🙂)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE)) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1300). [13:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:49] o/ [13:00:53] I can deploy ^^ [13:01:35] (03CR) 10Stevemunene: wdqs: add main and scholarly puppet config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [13:01:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE)) [13:02:18] (03Merged) 10jenkins-bot: Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE)) [13:02:33] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1052699|Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] (T369495)]] [13:02:37] T369495: Make `haswbstatement:` work for the EntitySchema property - https://phabricator.wikimedia.org/T369495 [13:05:58] (03PS1) 10MVernon: apus: add active/active geoip service record [dns] - 10https://gerrit.wikimedia.org/r/1054346 (https://phabricator.wikimedia.org/T279621) [13:08:20] k8s image build feels like it’s taking unusually long [13:08:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9981160 (10elukey) As FYI we already have T367970 to upgrade pxelinux to 6.04, but IIRC we already manually tested it and it didn't fix the issue (that... [13:08:38] maybe because it’s the first build this week? [13:09:18] ok now it’s done (took 6½ minutes all in all) [13:09:20] (03PS1) 10MVernon: hiera: use discovery hostname in apus probes [puppet] - 10https://gerrit.wikimedia.org/r/1054347 (https://phabricator.wikimedia.org/T279621) [13:11:48] docker_pull_k8s also taking much longer than usual [13:14:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:39] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1052699|Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] (T369495)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:15:43] T369495: Make `haswbstatement:` work for the EntitySchema property - https://phabricator.wikimedia.org/T369495 [13:15:47] alright, let’s test [13:17:26] hm, not seeing any changes so far… [13:18:55] anybody happen to know how I can force re-indexing of a page? [13:19:00] I already edited it but it seems to have had no effect [13:19:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:00] oh, I should look at logstash [13:20:36] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:45] hm, nothing there AFAICT [13:22:25] ok https://www.wikidata.org/wiki/Q4115189?action=cirrusDump just updated [13:22:27] guess it was delayed [13:22:38] P12886 is in outgoing_link now [13:22:47] but not in statement_keywords 😔 [13:24:33] although… if the search updating is delayed / async [13:24:40] then I guess it makes sense that I’m not seeing the effect of my config change yet [13:24:48] as the job runner(?) wouldn’t be using mwdebug [13:25:11] so I guess I’ll just have to roll it out, watch logstash, and be ready to roll back in case it provokes errors on the job runners [13:25:42] let’s go ahead with that then [13:25:44] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [13:27:53] Lucas_WMDE: I seem to see it (P12886=E123) using the cirrus doc build code: https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&revids=2204971309&prop=cirrusbuilddoc [13:28:08] yay, thanks! [13:28:12] I already forgot that existed [13:28:12] but there's some caching there that makes it hard to test as well [13:28:21] ah right [13:28:28] the cache that I added mt_rand() to the key in localhost ^^ [13:29:09] :) [13:31:00] (03CR) 10JMeybohm: [C:03+1] Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [13:33:24] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1052699|Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] (T369495)]] (duration: 30m 51s) [13:33:28] T369495: Make `haswbstatement:` work for the EntitySchema property - https://phabricator.wikimedia.org/T369495 [13:35:45] (03CR) 10Elukey: [C:03+2] Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [13:36:14] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981304 (10ssingh) [13:36:32] (03PS1) 10Clément Goubert: turnilo: Fix url shortening [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949) [13:36:39] (03Merged) 10jenkins-bot: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [13:37:03] (03PS2) 10Clément Goubert: turnilo: Fix url shortening [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949) [13:37:03] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:39:27] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981315 (10ssingh) [13:39:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2005.codfw.wmnet with OS bookworm [13:39:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9981320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm [13:40:27] (03CR) 10Btullis: [C:03+1] "Excellent! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:40:40] (03CR) 10Clément Goubert: [C:03+2] turnilo: Fix url shortening [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:41:54] !log UTC afternoon backport+config window done [13:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:12] <_joe_> !log uploading conftool 3.1.0 to bookworm,bullseye,buster [13:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:35] _joe_: <3 [13:46:26] 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9981330 (10Marostegui) >>! In T369855#9979761, @Ladsgroup wrote: > Also noting that this is a candidate master. All hosts in x1 are potential candidate masters. They all ru... [13:49:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:36] XioNoX: ^ known? [13:50:04] sukhe: yeah, it's a downtime on the not yet live netbox servers that expired [13:50:14] I'll re-downtime it [13:50:27] ah ok, 1003 [13:50:28] thanks [13:50:35] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netboxdb2003.codfw.wmnet with reason: netbox upgrade prep work [13:50:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netboxdb2003.codfw.wmnet with reason: netbox upgrade prep work [13:51:04] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [13:51:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [13:53:00] (03CR) 10Ssingh: [C:03+1] apus: add active/active geoip service record [dns] - 10https://gerrit.wikimedia.org/r/1054346 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:53:36] (03CR) 10Ssingh: [C:03+1] hiera: use discovery hostname in apus probes [puppet] - 10https://gerrit.wikimedia.org/r/1054347 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:53:46] (03CR) 10Ssingh: [C:03+1] hiera: mark apus service as in production [puppet] - 10https://gerrit.wikimedia.org/r/1054344 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:53:54] !log oblivian@puppetmaster2001 conftool action : set/pooled=yes; selector: name=mw1386.*,cluster=kubernetes,dc=eqiad [reason: Test conftool sal logging] [13:54:03] <_joe_> sukhe: ^^ [13:54:06] :D [13:54:32] <_joe_> there is a problem though to install the new version on the other puppetmasters [13:54:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2005.codfw.wmnet with reason: host reimage [13:55:00] _joe_: what kind of issue? [13:56:09] (03CR) 10Elukey: [C:03+1] pyrra: add liftwing SLOs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [13:57:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed in moss-be2002 - https://phabricator.wikimedia.org/T370034#9981365 (10Jhancock.wm) The drive was blinking. thanks for that. The disk has been replaced. [13:57:44] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981361 (10ayounsi) If I was paranoid, I'd say it's possibly a bug being exploited that can cause a DDoS and we should prioritize T364092. We have a couple runbooks that could fit the sit... [13:58:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2005.codfw.wmnet with reason: host reimage [13:58:35] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981370 (10ssingh) >>! In T370048#9981361, @ayounsi wrote: > If I was paranoid, I'd say it's possibly a bug being exploited that can cause a DDoS and we should prioritize T364092. > > We... [13:59:26] <_joe_> sukhe: we have some rules in requestctl that are supposedly cache_miss_only: false [13:59:30] <_joe_> and they'd be moved out [13:59:39] <_joe_> they all seem old stuff that shouldn't be there atm [14:00:07] (03CR) 10Phuedx: [C:03+1] MediaWikiPingback is now on event platform. Use eventlogging_legacy refine job [puppet] - 10https://gerrit.wikimedia.org/r/1050008 (https://phabricator.wikimedia.org/T323828) (owner: 10Ottomata) [14:00:23] <_joe_> or, we can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054081/ and followups [14:00:43] <_joe_> actually, I think I'll upgrade [14:00:46] noted. hth if you need a second pair of eyes [14:03:18] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9981375 (10VRiley-WMF) Hey @fgiunchedi sorry for the late response. I am available to work on this today. Please be aware, we will have to physically move the server in order to plug in a 10Gbit con... [14:03:27] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9981376 (10VRiley-WMF) a:03VRiley-WMF [14:04:23] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9981382 (10VRiley-WMF) @Eevans Would we be able to move forward with this today or tomorrow? Let us know, thanks! [14:06:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:06:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:06:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:07:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:07:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66483 and previous config saved to /var/cache/conftool/dbconfig/20240715-140720-arnaudb.json [14:07:38] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:09:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66484 and previous config saved to /var/cache/conftool/dbconfig/20240715-140934-arnaudb.json [14:11:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed in moss-be2002 - https://phabricator.wikimedia.org/T370034#9981395 (10Jhancock.wm) a:03Jhancock.wm [14:13:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2005.codfw.wmnet with OS bookworm [14:13:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9981400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm completed: - dbproxy... [14:15:55] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: readability_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054080 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [14:15:55] (03CR) 10Aqu: [C:03+1] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052762 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [14:16:40] <_joe_> !log updating conftool to 3.1.0 fleet wide [14:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:19] (03CR) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto) [14:19:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9981439 (10Marostegui) @Papaul dbproxy2005 looks good now - no ipv6 and I can reach it just fine. If you want to move it back to 10G that's great, and if you'd want t... [14:24:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P66485 and previous config saved to /var/cache/conftool/dbconfig/20240715-142441-arnaudb.json [14:25:27] (03CR) 10Herron: [C:03+1] o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288 (owner: 10Filippo Giunchedi) [14:36:48] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288 (owner: 10Filippo Giunchedi) [14:39:18] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P66486 and previous config saved to /var/cache/conftool/dbconfig/20240715-143948-arnaudb.json [14:44:12] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q1): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#9981546 (10fgiunchedi) Thank you @LSobanski ! I'll be reaching out to the individual service owners [14:45:27] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9981563 (10Eevans) >>! In T368766#9981382, @VRiley-WMF wrote: > @Eevans Would we be able to move forward with this today or tomorrow? Let us know, thanks! Sure, that works. Let me know when! [14:45:34] (03CR) 10EoghanGaffney: "One more small comment, after that I think it's good to go." [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [14:48:22] (03PS1) 10Gmodena: eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587) [14:49:59] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033 [14:50:03] T362033: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033 [14:50:13] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033 [14:50:18] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9981616 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9483e0b8-53c7-4b67-8ac7-0ee42edaeba5) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with r... [14:52:34] (03CR) 10AOkoth: vrts: fix proxy for download (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [14:53:51] (03PS6) 10AOkoth: vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) [14:54:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66487 and previous config saved to /var/cache/conftool/dbconfig/20240715-145455-arnaudb.json [14:54:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:55:00] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:55:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:55:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66488 and previous config saved to /var/cache/conftool/dbconfig/20240715-145517-arnaudb.json [14:57:04] (03CR) 10AOkoth: vrts: fix proxy for download (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [14:57:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66489 and previous config saved to /var/cache/conftool/dbconfig/20240715-145728-arnaudb.json [14:58:17] (03CR) 10Kevin Bazira: [C:03+2] ml-services: readability_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054080 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [14:59:06] (03Merged) 10jenkins-bot: ml-services: readability_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054080 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [14:59:18] FIRING: [2x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:36] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:06:26] hmm [15:07:11] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:07:50] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9981696 (10VRiley-WMF) 05Open→03Resolved I have placed the HDD's back into the original server and have booted it up. Since this ticket is specific for the SSH/Managment mismatch, I'll be closing this ticket. [15:09:57] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:10:36] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:11:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:12:02] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:12:35] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:12:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P66490 and previous config saved to /var/cache/conftool/dbconfig/20240715-151235-arnaudb.json [15:13:14] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:13:36] !log mnz@deploy1002 Started deploy [airflow-dags/research@5121748]: (no justification provided) [15:14:08] !log mnz@deploy1002 Finished deploy [airflow-dags/research@5121748]: (no justification provided) (duration: 00m 31s) [15:14:18] RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:44] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T370062 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:15:55] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T370062 (10ops-monitoring-bot) 03NEW [15:16:18] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:16:50] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:17:19] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:26:05] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9981821 (10Volans) Sorry if I'm late to the task, I discovered it just today as I was not subscribed to it. Allow me to be really sad that in this whole discu... [15:27:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P66491 and previous config saved to /var/cache/conftool/dbconfig/20240715-152742-arnaudb.json [15:28:47] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlserve@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:31:35] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1530). [15:31:58] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work [15:32:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work [15:34:49] (03PS1) 10Filippo Giunchedi: o11y: disable pint promql/series for BenthosKafkaConsumerLag + webrequest [alerts] - 10https://gerrit.wikimedia.org/r/1054363 (https://phabricator.wikimedia.org/T369737) [15:36:42] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049915 (https://phabricator.wikimedia.org/T356924) (owner: 10Dreamy Jazz) [15:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:42:42] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 (10ssingh) 03NEW [15:42:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66492 and previous config saved to /var/cache/conftool/dbconfig/20240715-154250-arnaudb.json [15:42:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:42:55] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:43:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:43:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66493 and previous config saved to /var/cache/conftool/dbconfig/20240715-154312-arnaudb.json [15:45:08] 06SRE, 06collaboration-services: gitlab2002: wrong network for pulic IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#9982100 (10LSobanski) a:03Jelto [15:45:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66494 and previous config saved to /var/cache/conftool/dbconfig/20240715-154526-arnaudb.json [15:46:20] 06SRE, 06collaboration-services: gitlab2002: wrong network for pulic IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#9982117 (10LSobanski) p:05Triage→03Medium [15:46:21] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 102586240 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:47:21] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 64920 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:47:29] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:47:35] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:53:47] RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlserve@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:57:57] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9982145 (10ssingh) >>! In T369366#9981821, @Volans wrote: > Sorry if I'm late to the task, I discovered it just today as I was not subscribed to it. > > Allow... [15:59:17] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9982167 (10Jhancock.wm) Got the card back @fgiunchedi. I'm free to swap it anytime on Tuesday or Thursday between 8am and 4pm CDT [16:00:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P66495 and previous config saved to /var/cache/conftool/dbconfig/20240715-160033-arnaudb.json [16:02:54] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9982213 (10fgiunchedi) >>! In T369826#9982167, @Jhancock.wm wrote: > Got the card back @fgiunchedi. I'm free to swap it anytime on Tuesday or Thursday between 8am and 4pm CDT Thank you ! I'm good w... [16:06:35] (03PS1) 10Effie Mouzeli: mcrouter: test bookworm image on mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) [16:11:49] (03PS1) 10Effie Mouzeli: mcrouter: test bookworm image on mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366) [16:14:38] (03CR) 10Herron: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [16:15:06] 06SRE, 06collaboration-services: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#9982267 (10Aklapper) [16:15:35] RECOVERY - dump of s6 in codfw on backupmon1001 is OK: Last dump for s6 at codfw (db2197) taken on 2024-07-15 14:49:19 (74 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:15:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P66496 and previous config saved to /var/cache/conftool/dbconfig/20240715-161541-arnaudb.json [16:16:45] (03CR) 10Bartosz Dziewoński: "Can it be merged and deployed for real now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [16:18:30] (03CR) 10Elukey: "Thanks! Is it possible that the new image config is misaligned? I don't see it in the CI's diff :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [16:26:29] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:28:30] (03PS1) 10AOkoth: vrts: change root mail alias [puppet] - 10https://gerrit.wikimedia.org/r/1054369 (https://phabricator.wikimedia.org/T369445) [16:28:47] (03CR) 10EoghanGaffney: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [16:29:01] (03PS2) 10Effie Mouzeli: mcrouter: test bookworm image on mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366) [16:29:17] (03PS1) 10Ssingh: Release 0.9.8-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) [16:30:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66497 and previous config saved to /var/cache/conftool/dbconfig/20240715-163048-arnaudb.json [16:30:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:30:52] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:31:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:31:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T367781)', diff saved to https://phabricator.wikimedia.org/P66498 and previous config saved to /var/cache/conftool/dbconfig/20240715-163110-arnaudb.json [16:31:29] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:33:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T367781)', diff saved to https://phabricator.wikimedia.org/P66499 and previous config saved to /var/cache/conftool/dbconfig/20240715-163320-arnaudb.json [16:36:16] (03PS1) 10Arlolra: Revert changes in log levels [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 [16:38:09] (03CR) 10Dzahn: [C:03+1] "good idea, let's try it. make sure to send some test mail though" [puppet] - 10https://gerrit.wikimedia.org/r/1054369 (https://phabricator.wikimedia.org/T369445) (owner: 10AOkoth) [16:38:45] (03CR) 10Effie Mouzeli: "I think it is just the CI, I will get back to you as soon as know for sure" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [16:43:32] (03CR) 10AOkoth: "Yeah, I can try that after merging this." [puppet] - 10https://gerrit.wikimedia.org/r/1054369 (https://phabricator.wikimedia.org/T369445) (owner: 10AOkoth) [16:44:34] (03PS1) 10Dzahn: remove git.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1054372 (https://phabricator.wikimedia.org/T323073) [16:45:30] (03CR) 10Dzahn: "I will start with DNS first since it's trivial to revert just in case. After a little waiting period then coming back to this." [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [16:45:40] (03CR) 10Dzahn: "I will start with DNS first since it's trivial to revert just in case. After a little waiting period then coming back to this." [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [16:47:29] (03CR) 10AOkoth: [C:03+2] vrts: change root mail alias [puppet] - 10https://gerrit.wikimedia.org/r/1054369 (https://phabricator.wikimedia.org/T369445) (owner: 10AOkoth) [16:48:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P66500 and previous config saved to /var/cache/conftool/dbconfig/20240715-164827-arnaudb.json [16:48:50] (03PS2) 10DCausse: team-search-platform: migrate cirrus_cluster_checks [alerts] - 10https://gerrit.wikimedia.org/r/1054317 (https://phabricator.wikimedia.org/T359033) [16:48:50] (03PS1) 10DCausse: team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) [16:50:52] (03CR) 10CI reject: [V:04-1] team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse) [16:51:15] 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9982422 (10Jhancock.wm) We won't need to move racks. But because of the way the switches are, we can't reuse the same port on the switch. we'll be moving to a different set of 4. Are you going to re... [16:55:25] (03PS2) 10DCausse: team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1700) [17:00:05] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1700). [17:03:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P66501 and previous config saved to /var/cache/conftool/dbconfig/20240715-170334-arnaudb.json [17:06:29] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:08:31] (03Abandoned) 10Urbanecm: lists::automation: Update stewards-l in real mode [puppet] - 10https://gerrit.wikimedia.org/r/1052188 (https://phabricator.wikimedia.org/T351202) (owner: 10Urbanecm) [17:08:45] (03PS1) 10Papaul: Add frand200[1-2] to DNS files [dns] - 10https://gerrit.wikimedia.org/r/1054377 [17:12:56] (03PS2) 10Scott French: kserve-inference: update references to deprecated services in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) [17:14:11] (03PS2) 10Scott French: mobileapps: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053806 (https://phabricator.wikimedia.org/T367949) [17:14:11] (03PS2) 10Scott French: push-notifications: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053807 (https://phabricator.wikimedia.org/T367949) [17:14:11] (03PS2) 10Scott French: wikifeeds: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053808 (https://phabricator.wikimedia.org/T367949) [17:17:21] (03CR) 10Dwisehaupt: [C:03+1] "Those hostnames and IPs look good and in the correct ranges. Shipit." [dns] - 10https://gerrit.wikimedia.org/r/1054377 (owner: 10Papaul) [17:17:50] (03CR) 10Scott French: [C:03+2] kserve-inference: update references to deprecated services in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [17:18:06] (03CR) 10Papaul: [C:03+2] Add frand200[1-2] to DNS files [dns] - 10https://gerrit.wikimedia.org/r/1054377 (owner: 10Papaul) [17:18:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T367781)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240715-171841-arnaudb.json [17:18:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:18:52] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:19:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:19:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T367781)', diff saved to https://phabricator.wikimedia.org/P66503 and previous config saved to /var/cache/conftool/dbconfig/20240715-171908-arnaudb.json [17:19:36] (03CR) 10Scott French: "Alas, forgot to bump the chart version in this one before (done)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053808 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [17:19:38] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9982548 (10Volans) Thanks for the clarification. I didn't meant to imply that you didn't want a cookbook as end goal (although it was not mentioned). >>! In T... [17:19:56] (03Merged) 10jenkins-bot: kserve-inference: update references to deprecated services in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [17:21:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T367781)', diff saved to https://phabricator.wikimedia.org/P66504 and previous config saved to /var/cache/conftool/dbconfig/20240715-172118-arnaudb.json [17:23:21] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9982576 (10Papaul) [17:36:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P66505 and previous config saved to /var/cache/conftool/dbconfig/20240715-173625-arnaudb.json [17:38:09] (03PS2) 10Ssingh: Release 0.9.8-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) [17:40:55] !log mnz@deploy1002 Started deploy [airflow-dags/research@5121748]: (no justification provided) [17:41:06] !log mnz@deploy1002 Finished deploy [airflow-dags/research@5121748]: (no justification provided) (duration: 00m 10s) [17:41:27] (03CR) 10Ssingh: "I think this is low to medium priority but ready for review. OK build on build2001:" [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [17:42:54] (03CR) 10Ssingh: "The bullseye packages are not updated because the hosts are on bullseye so there is no need for us to follow suit with 0.9.8 there." [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [17:51:08] (03PS1) 10Arlolra: Revert "Change Linter log level to info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 [17:51:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P66506 and previous config saved to /var/cache/conftool/dbconfig/20240715-175133-arnaudb.json [17:55:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 (owner: 10Arlolra) [17:55:49] (03CR) 10Jgiannelos: [C:03+1] Revert "Change Linter log level to info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra) [17:56:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra) [17:56:04] (03CR) 10Jgiannelos: [C:03+1] Revert changes in log levels [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 (owner: 10Arlolra) [17:58:28] (03CR) 10Dzahn: [V:03+1 C:03+2] mailman3: defined type to sync list members, create timers for each list [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:01:01] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [18:04:15] !log upgraded prometheus-ipmi-exporter to 1.8.0 T368088 [18:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:29] T368088: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088 [18:04:47] (03PS3) 10Herron: prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) [18:06:04] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9982911 (10ssingh) >>! In T369366#9982548, @Volans wrote: > Thanks for the clarification. I didn't meant to imply that you didn't want a cookbook as end goal (... [18:06:24] (03CR) 10Ebernhardson: [C:03+1] "Looks reasonable, PCC reports mostly what i expect. It suspiciously claims a bunch of lines added and non removed in /etc/wdqs/allowlist-w" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [18:06:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T367781)', diff saved to https://phabricator.wikimedia.org/P66507 and previous config saved to /var/cache/conftool/dbconfig/20240715-180640-arnaudb.json [18:06:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:06:53] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:06:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:07:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1229.eqiad.wmnet with reason: Maintenance [18:07:11] (03PS4) 10Ryan Kemper: wdqs restart envoy: support graph split aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1053778 (https://phabricator.wikimedia.org/T364077) [18:07:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1229.eqiad.wmnet with reason: Maintenance [18:07:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T367781)', diff saved to https://phabricator.wikimedia.org/P66508 and previous config saved to /var/cache/conftool/dbconfig/20240715-180726-arnaudb.json [18:09:29] (03CR) 10Herron: [C:03+2] prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) (owner: 10Herron) [18:09:36] (03PS4) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [18:09:36] (03PS3) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [18:09:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T367781)', diff saved to https://phabricator.wikimedia.org/P66509 and previous config saved to /var/cache/conftool/dbconfig/20240715-180937-arnaudb.json [18:10:14] (03CR) 10BCornwall: Add public suffix list module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [18:10:41] (03PS8) 10Jdlrobson: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) [18:10:50] (03CR) 10BCornwall: Add public suffix list module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [18:11:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson) [18:11:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson) [18:11:41] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3232/console" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [18:13:11] 10SRE-Access-Requests, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091 (10Quiddity) 03NEW [18:13:29] (03CR) 10BCornwall: ncmonitor: Set path for public suffix domain list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [18:15:32] (03PS4) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [18:16:27] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3234/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [18:22:16] (03CR) 10Ryan Kemper: [C:03+2] wdqs restart envoy: support graph split aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1053778 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [18:24:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66510 and previous config saved to /var/cache/conftool/dbconfig/20240715-182426-root.json [18:24:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66511 and previous config saved to /var/cache/conftool/dbconfig/20240715-182436-root.json [18:24:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P66512 and previous config saved to /var/cache/conftool/dbconfig/20240715-182444-arnaudb.json [18:25:38] (03PS1) 10Marostegui: Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054385 [18:25:54] (03PS1) 10Marostegui: Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054386 [18:26:40] (03CR) 10Marostegui: [C:03+2] Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054386 (owner: 10Marostegui) [18:26:48] (03CR) 10Marostegui: [C:03+2] Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054385 (owner: 10Marostegui) [18:39:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66513 and previous config saved to /var/cache/conftool/dbconfig/20240715-183931-root.json [18:39:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66514 and previous config saved to /var/cache/conftool/dbconfig/20240715-183942-root.json [18:39:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P66515 and previous config saved to /var/cache/conftool/dbconfig/20240715-183952-arnaudb.json [18:42:42] (03PS1) 10Dzahn: mailman3: add missing whitespace in sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1054388 (https://phabricator.wikimedia.org/T351202) [18:45:13] (03CR) 10Ssingh: Add public suffix list module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [18:48:35] (03CR) 10Ssingh: Add public suffix list module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [18:54:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66516 and previous config saved to /var/cache/conftool/dbconfig/20240715-185437-root.json [18:54:45] (03CR) 10Dzahn: [C:03+2] mailman3: add missing whitespace in sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1054388 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:54:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66517 and previous config saved to /var/cache/conftool/dbconfig/20240715-185447-root.json [18:55:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T367781)', diff saved to https://phabricator.wikimedia.org/P66518 and previous config saved to /var/cache/conftool/dbconfig/20240715-185459-arnaudb.json [18:55:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1233.eqiad.wmnet with reason: Maintenance [18:55:03] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:55:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1233.eqiad.wmnet with reason: Maintenance [18:55:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T367781)', diff saved to https://phabricator.wikimedia.org/P66519 and previous config saved to /var/cache/conftool/dbconfig/20240715-185521-arnaudb.json [18:57:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T367781)', diff saved to https://phabricator.wikimedia.org/P66520 and previous config saved to /var/cache/conftool/dbconfig/20240715-185736-arnaudb.json [18:59:18] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:09:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66521 and previous config saved to /var/cache/conftool/dbconfig/20240715-190942-root.json [19:09:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66522 and previous config saved to /var/cache/conftool/dbconfig/20240715-190953-root.json [19:12:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P66523 and previous config saved to /var/cache/conftool/dbconfig/20240715-191243-arnaudb.json [19:16:23] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.144`. Pre-deploy tests passing on canary `wdqs1016` [19:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:36] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@9ad2bec]: 0.3.144 [19:17:04] !log [WDQS Deploy] Tests passing following deploy of `0.3.144` on canary `wdqs1016`; proceeding to rest of fleet [19:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:51] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1098-1099].eqiad.wmnet with reason: T348977 [19:23:54] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [19:24:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1098-1099].eqiad.wmnet with reason: T348977 [19:24:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66524 and previous config saved to /var/cache/conftool/dbconfig/20240715-192448-root.json [19:24:52] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic109[8-9]* for T348977 - bking@cumin2002 [19:24:55] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic109[8-9]* for T348977 - bking@cumin2002 [19:24:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66525 and previous config saved to /var/cache/conftool/dbconfig/20240715-192458-root.json [19:25:07] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@9ad2bec]: 0.3.144 (duration: 08m 31s) [19:27:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P66526 and previous config saved to /var/cache/conftool/dbconfig/20240715-192750-arnaudb.json [19:28:22] (03PS8) 10Ryan Kemper: wdqs: enable throttling only for reqs from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [19:29:01] (03CR) 10CI reject: [V:04-1] wdqs: enable throttling only for reqs from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [19:34:46] (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [19:36:39] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [19:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:39:00] (03CR) 10Krinkle: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [19:39:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29691 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:39:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66527 and previous config saved to /var/cache/conftool/dbconfig/20240715-193953-root.json [19:40:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66528 and previous config saved to /var/cache/conftool/dbconfig/20240715-194004-root.json [19:42:17] (03PS9) 10Ryan Kemper: wdqs: enable throttling only for reqs from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [19:42:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T367781)', diff saved to https://phabricator.wikimedia.org/P66529 and previous config saved to /var/cache/conftool/dbconfig/20240715-194257-arnaudb.json [19:42:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:43:02] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:43:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:43:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1246.eqiad.wmnet with reason: Maintenance [19:43:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1246.eqiad.wmnet with reason: Maintenance [19:43:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T367781)', diff saved to https://phabricator.wikimedia.org/P66530 and previous config saved to /var/cache/conftool/dbconfig/20240715-194344-arnaudb.json [19:46:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T367781)', diff saved to https://phabricator.wikimedia.org/P66531 and previous config saved to /var/cache/conftool/dbconfig/20240715-194559-arnaudb.json [19:47:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T367856)', diff saved to https://phabricator.wikimedia.org/P66532 and previous config saved to /var/cache/conftool/dbconfig/20240715-194711-marostegui.json [19:47:16] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [19:51:11] (03CR) 10Ryan Kemper: "WDQS deployed. We'll try merging this and then seeing with tcpdump on wdqs1023 if the appropriate header is set when we do a test federate" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [19:51:18] (03CR) 10Ryan Kemper: [C:03+2] wdqs: enable throttling only for reqs from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [19:52:07] (03CR) 10Ryan Kemper: [C:03+2] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [19:55:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66533 and previous config saved to /var/cache/conftool/dbconfig/20240715-195459-root.json [19:55:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66534 and previous config saved to /var/cache/conftool/dbconfig/20240715-195510-root.json [19:57:30] (03PS1) 10Ryan Kemper: Revert "wdqs: enable throttling only for reqs from the CDN" [puppet] - 10https://gerrit.wikimedia.org/r/1054392 [19:59:39] (03PS1) 10Ryan Kemper: wdqs: map lines missing trailing ; [puppet] - 10https://gerrit.wikimedia.org/r/1054393 (https://phabricator.wikimedia.org/T361950) [19:59:54] (03CR) 10Ryan Kemper: "This revert may not be necessary if https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054393 works" [puppet] - 10https://gerrit.wikimedia.org/r/1054392 (owner: 10Ryan Kemper) [19:59:58] (03CR) 10CI reject: [V:04-1] Revert "wdqs: enable throttling only for reqs from the CDN" [puppet] - 10https://gerrit.wikimedia.org/r/1054392 (owner: 10Ryan Kemper) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T2000). [20:00:04] arlolra and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] (03CR) 10Ryan Kemper: [C:03+2] wdqs: map lines missing trailing ; [puppet] - 10https://gerrit.wikimedia.org/r/1054393 (https://phabricator.wikimedia.org/T361950) (owner: 10Ryan Kemper) [20:01:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P66535 and previous config saved to /var/cache/conftool/dbconfig/20240715-200106-arnaudb.json [20:02:16] here [20:02:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P66536 and previous config saved to /var/cache/conftool/dbconfig/20240715-200218-marostegui.json [20:07:18] (03PS1) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) [20:11:54] (03PS2) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) [20:15:13] urandom: cjming TheresNoTime RoanKattouw are either of you around to help with a deploy? [20:16:05] Yeah I can deploy [20:16:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P66537 and previous config saved to /var/cache/conftool/dbconfig/20240715-201613-arnaudb.json [20:17:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P66538 and previous config saved to /var/cache/conftool/dbconfig/20240715-201726-marostegui.json [20:17:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson) [20:18:14] arlolra: Are you here for your deployment? [20:18:30] yeah, sorry, I'm just reverting the changes we made last week [20:18:36] thanks RoanKattouw [20:18:40] (03Merged) 10jenkins-bot: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson) [20:18:58] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1050082|[July 15th] Deploy dark mode to all logged-in users (T368795)]] [20:19:04] T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795 [20:19:15] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [20:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:18] (03PS3) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855) [20:19:44] (03CR) 10Catrope: [C:03+2] Revert changes in log levels [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 (owner: 10Arlolra) [20:22:07] !log catrope@deploy1002 jdlrobson, catrope: Backport for [[gerrit:1050082|[July 15th] Deploy dark mode to all logged-in users (T368795)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:22:19] Jdlrobson: Please test on the test servers [20:22:34] (03Merged) 10jenkins-bot: Revert changes in log levels [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 (owner: 10Arlolra) [20:22:36] RoanKattouw: on it [20:24:24] RoanKattouw: lgtm! Please sync! [20:24:27] !log catrope@deploy1002 jdlrobson, catrope: Continuing with sync [20:29:24] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1050082|[July 15th] Deploy dark mode to all logged-in users (T368795)]] (duration: 10m 26s) [20:29:28] T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795 [20:30:39] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 18.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:30:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra) [20:31:08] (03PS2) 10Arlolra: Revert "Change Linter log level to info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 [20:31:14] (03CR) 10TrainBranchBot: "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra) [20:31:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T367781)', diff saved to https://phabricator.wikimedia.org/P66539 and previous config saved to /var/cache/conftool/dbconfig/20240715-203120-arnaudb.json [20:31:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:31:25] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [20:31:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:31:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2125.codfw.wmnet with reason: Maintenance [20:31:53] (03Merged) 10jenkins-bot: Revert "Change Linter log level to info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra) [20:31:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2125.codfw.wmnet with reason: Maintenance [20:32:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T367781)', diff saved to https://phabricator.wikimedia.org/P66540 and previous config saved to /var/cache/conftool/dbconfig/20240715-203203-arnaudb.json [20:32:10] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1054371|Revert changes in log levels]], [[gerrit:1054382|Revert "Change Linter log level to info"]] [20:32:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T367856)', diff saved to https://phabricator.wikimedia.org/P66541 and previous config saved to /var/cache/conftool/dbconfig/20240715-203233-marostegui.json [20:32:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:32:38] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [20:32:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [20:32:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:33:11] 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9983336 (10MNeisler) [20:34:32] !log catrope@deploy1002 arlolra, catrope: Backport for [[gerrit:1054371|Revert changes in log levels]], [[gerrit:1054382|Revert "Change Linter log level to info"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:34:37] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#9983340 (10Jhancock.wm) @Papaul these servers have been cabled, bios updated, and pwd set. pending idrac IPs. franio2001 eth0 <-> FASW-C8A eth-0/0/18 eth1 <-> FASW-C8B... [20:34:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367781)', diff saved to https://phabricator.wikimedia.org/P66542 and previous config saved to /var/cache/conftool/dbconfig/20240715-203435-arnaudb.json [20:34:46] arlolra: Can this be tested meaningfully or should I just continue to sync? [20:35:02] Just continue the sync, thanks [20:35:06] !log catrope@deploy1002 arlolra, catrope: Continuing with sync [20:35:15] Thanks RoanKattouw looks like sync was successful! [20:37:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [20:39:51] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1054371|Revert changes in log levels]], [[gerrit:1054382|Revert "Change Linter log level to info"]] (duration: 07m 41s) [20:40:27] And that's it, all done [20:41:23] Thanks for your time RoanKattouw [20:44:39] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [20:44:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [20:46:31] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29705 bytes in 1.310 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:49:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [20:49:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P66543 and previous config saved to /var/cache/conftool/dbconfig/20240715-204944-arnaudb.json [21:00:04] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T2100) [21:02:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:04:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P66544 and previous config saved to /var/cache/conftool/dbconfig/20240715-210451-arnaudb.json [21:12:10] (03PS2) 10Dzahn: remove git.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1054372 (https://phabricator.wikimedia.org/T323073) [21:13:59] (03CR) 10EoghanGaffney: [C:03+1] "Approved. This is the approach the team agreed on" [dns] - 10https://gerrit.wikimedia.org/r/1054372 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [21:15:44] (03CR) 10Dzahn: [C:03+2] remove git.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1054372 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [21:19:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367781)', diff saved to https://phabricator.wikimedia.org/P66545 and previous config saved to /var/cache/conftool/dbconfig/20240715-211957-arnaudb.json [21:20:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2126.codfw.wmnet with reason: Maintenance [21:20:04] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:20:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2126.codfw.wmnet with reason: Maintenance [21:20:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:20:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:20:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T367781)', diff saved to https://phabricator.wikimedia.org/P66546 and previous config saved to /var/cache/conftool/dbconfig/20240715-212034-arnaudb.json [21:20:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052762 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [21:23:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367781)', diff saved to https://phabricator.wikimedia.org/P66547 and previous config saved to /var/cache/conftool/dbconfig/20240715-212302-arnaudb.json [21:34:56] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9983460 (10KFrancis) Hello @JJMC89, please send you full name and postal address to kfrancis@wikimedia.org and I'll get the NDA processed. Thanks! [21:38:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P66548 and previous config saved to /var/cache/conftool/dbconfig/20240715-213809-arnaudb.json [21:53:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P66549 and previous config saved to /var/cache/conftool/dbconfig/20240715-215316-arnaudb.json [22:06:39] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:08:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367781)', diff saved to https://phabricator.wikimedia.org/P66550 and previous config saved to /var/cache/conftool/dbconfig/20240715-220823-arnaudb.json [22:08:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2138.codfw.wmnet with reason: Maintenance [22:08:34] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:08:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2138.codfw.wmnet with reason: Maintenance [22:08:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T367781)', diff saved to https://phabricator.wikimedia.org/P66551 and previous config saved to /var/cache/conftool/dbconfig/20240715-220845-arnaudb.json [22:08:50] (03PS5) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [22:11:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T367781)', diff saved to https://phabricator.wikimedia.org/P66552 and previous config saved to /var/cache/conftool/dbconfig/20240715-221117-arnaudb.json [22:11:24] (03PS5) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 [22:11:24] (03PS6) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) [22:12:24] (03CR) 10BCornwall: Add public suffix list module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [22:15:06] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3235/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [22:26:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P66553 and previous config saved to /var/cache/conftool/dbconfig/20240715-222624-arnaudb.json [22:30:39] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:36:40] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#9983561 (10RobH) [22:41:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P66554 and previous config saved to /var/cache/conftool/dbconfig/20240715-224131-arnaudb.json [22:50:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T370062#9983576 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate for T362033 [22:53:29] (03PS1) 10Dzahn: gerrit: switch firewall provider to nftables at role level [puppet] - 10https://gerrit.wikimedia.org/r/1054398 [22:56:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9983589 (10Jclark-ctr) 05Open→03Resolved [22:56:16] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9983583 (10JJMC89) https://gitlab.wi... [22:56:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T367781)', diff saved to https://phabricator.wikimedia.org/P66555 and previous config saved to /var/cache/conftool/dbconfig/20240715-225639-arnaudb.json [22:56:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2148.codfw.wmnet with reason: Maintenance [22:56:43] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:56:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2148.codfw.wmnet with reason: Maintenance [22:57:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T367781)', diff saved to https://phabricator.wikimedia.org/P66556 and previous config saved to /var/cache/conftool/dbconfig/20240715-225701-arnaudb.json [22:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:59:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T367781)', diff saved to https://phabricator.wikimedia.org/P66557 and previous config saved to /var/cache/conftool/dbconfig/20240715-225933-arnaudb.json [22:59:35] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111 (10RobH) 03NEW [23:00:09] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#9983621 (10RobH) [23:05:08] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112 (10RobH) 03NEW [23:05:29] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#9983645 (10RobH) [23:06:06] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#9983648 (10andrea.denisse) a:03andrea.denisse [23:11:30] !log nshahquinn-wmf@deploy1002 Started deploy [airflow-dags/analytics_product@767d7ad]: (no justification provided) [23:11:39] !log nshahquinn-wmf@deploy1002 Finished deploy [airflow-dags/analytics_product@767d7ad]: (no justification provided) (duration: 00m 08s) [23:12:59] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#9983662 (10Jclark-ctr) You have successfully submitted request SR194058934. [23:14:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P66558 and previous config saved to /var/cache/conftool/dbconfig/20240715-231440-arnaudb.json [23:16:57] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9983669 (10JJMC89) >>! In T369314#9983460, @KFrancis wrote: > Hello @JJMC89, please send you full name and postal address to kfrancis@wikimedia.org and I'll get the NDA processed. Thanks!... [23:29:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P66559 and previous config saved to /var/cache/conftool/dbconfig/20240715-232947-arnaudb.json [23:37:13] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054400 [23:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054400 (owner: 10TrainBranchBot) [23:38:25] (03PS1) 10Zabe: Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529) [23:39:13] (03CR) 10CI reject: [V:04-1] Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [23:39:47] (03PS2) 10Zabe: Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529) [23:39:58] jouncebot: nowandnext [23:39:58] No deployments scheduled for the next 2 hour(s) and 20 minute(s) [23:39:58] In 2 hour(s) and 20 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0200) [23:41:05] (03CR) 10Zabe: [C:03+2] Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [23:41:44] (03Merged) 10jenkins-bot: Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [23:42:10] !log zabe@deploy1002 Started scap sync-world: Backport for [[gerrit:1054401|Further configurations for aewikimedia (T362529)]] [23:42:14] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [23:44:39] !log zabe@deploy1002 zabe: Backport for [[gerrit:1054401|Further configurations for aewikimedia (T362529)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:44:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T367781)', diff saved to https://phabricator.wikimedia.org/P66560 and previous config saved to /var/cache/conftool/dbconfig/20240715-234454-arnaudb.json [23:44:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2175.codfw.wmnet with reason: Maintenance [23:44:58] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [23:45:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2175.codfw.wmnet with reason: Maintenance [23:45:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T367781)', diff saved to https://phabricator.wikimedia.org/P66561 and previous config saved to /var/cache/conftool/dbconfig/20240715-234516-arnaudb.json [23:47:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T367781)', diff saved to https://phabricator.wikimedia.org/P66562 and previous config saved to /var/cache/conftool/dbconfig/20240715-234748-arnaudb.json [23:48:58] !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php aewikimedia translate # T362529 [23:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:01] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [23:49:11] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [23:49:28] !log zabe@deploy1002 zabe: Continuing with sync [23:54:37] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1054401|Further configurations for aewikimedia (T362529)]] (duration: 12m 26s) [23:54:41] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [23:56:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [23:56:55] (03CR) 10RLazarus: switchdc: prepare mediawiki cache warmup for bare-metal turndown (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053823 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French)