[00:01:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054071 (owner: 10TrainBranchBot)
[00:14:50] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:15:44] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29699 bytes in 3.261 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:45:12] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 216036000 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:46:12] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7136 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:58:31] <wikibugs>	 (03PS3) 10Dbrant: Enable account vanishing in CentralAuth (labs). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141)
[01:20:26] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 332.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:26:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T367856)', diff saved to https://phabricator.wikimedia.org/P66467 and previous config saved to /var/cache/conftool/dbconfig/20240715-012559-marostegui.json
[01:26:04] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[01:34:20] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 391.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:41:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P66469 and previous config saved to /var/cache/conftool/dbconfig/20240715-014106-marostegui.json
[01:53:50] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:55:20] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 47.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:56:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P66470 and previous config saved to /var/cache/conftool/dbconfig/20240715-015613-marostegui.json
[01:56:44] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29698 bytes in 2.501 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[02:01:26] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:11:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T367856)', diff saved to https://phabricator.wikimedia.org/P66471 and previous config saved to /var/cache/conftool/dbconfig/20240715-021121-marostegui.json
[02:11:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[02:11:25] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[02:11:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[02:15:26] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 51.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:39:18] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:46:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[02:51:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[02:59:18] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:18:48] <jinxer-wm>	 FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[03:23:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[03:28:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[03:29:26] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 430.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:37:13] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[03:40:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[03:41:26] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 57.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:45:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[04:09:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[04:12:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPV6 for dbproxy200[5-8] - pt1979@cumin2002"
[04:13:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPV6 for dbproxy200[5-8] - pt1979@cumin2002"
[04:13:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:18:26] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 394.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:34:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:39:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:47:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[04:47:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[04:47:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T367856)', diff saved to https://phabricator.wikimedia.org/P66472 and previous config saved to /var/cache/conftool/dbconfig/20240715-044723-marostegui.json
[04:47:28] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[05:05:16] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1054076 (https://phabricator.wikimedia.org/T370019)
[05:05:20] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1054077 (https://phabricator.wikimedia.org/T370019)
[05:12:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host dbproxy2005.codfw.wmnet with OS bookworm
[05:12:47] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9980121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host dbproxy2005.codfw.wmnet with OS bookworm
[05:23:14] <wikibugs>	 (03PS1) 10Marostegui: an-redacteddb1001.yaml: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1054078
[05:23:49] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] an-redacteddb1001.yaml: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1054078 (owner: 10Marostegui)
[05:25:09] <wikibugs>	 (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit)
[05:25:25] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9980122 (10Marostegui) @Papaul the interface wasn't in netbox anymore, but the DNS entry for that host is still gone.  I've tried to reimage the host but it gets stuck on the...
[05:27:00] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9980123 (10Marostegui) Just talked to @papaul - the reimage was expected to fail since the iface was moved back to the 10G one.
[05:39:08] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: readability_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054080 (https://phabricator.wikimedia.org/T369344)
[05:43:04] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794)
[05:43:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480)
[05:43:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: actually include the requestctl hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054083 (https://phabricator.wikimedia.org/T369480)
[05:46:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto)
[05:46:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480) (owner: 10Giuseppe Lavagetto)
[05:53:31] <wikibugs>	 (03PS1) 10NMW03: Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089)
[05:54:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Portal namespace for Ingush Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054084 (https://phabricator.wikimedia.org/T326089) (owner: 10NMW03)
[06:01:41] <wikibugs>	 (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054086 (https://phabricator.wikimedia.org/T349774)
[06:03:51] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054086 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza)
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:11] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054086 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza)
[06:06:26] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 9.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:06:27] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[06:06:48] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[06:06:49] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[06:07:23] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[06:07:24] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[06:07:51] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:22:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool db2136', diff saved to https://phabricator.wikimedia.org/P66473 and previous config saved to /var/cache/conftool/dbconfig/20240715-062216-root.json
[06:22:26] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] db1179: Disable notification for db1179 [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) (owner: 10Ladsgroup)
[06:23:11] <wikibugs>	 (03CR) 10Marostegui: "This will only work once the host is back up (so puppet runs), meanwhile I'd suggest to extend the downtime" [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) (owner: 10Ladsgroup)
[06:25:58] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:26:50] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29689 bytes in 1.384 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:30:31] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3222/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049761 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[06:31:47] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] R:idp New CAS 7 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1049761 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[06:48:28] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9980169 (10Marostegui)
[06:52:03] <marostegui>	 !log test
[06:59:18] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:57] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp1004.wikimedia.org
[07:00:59] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[07:03:16] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1004.wikimedia.org - slyngshede@cumin1002"
[07:04:21] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1004.wikimedia.org - slyngshede@cumin1002"
[07:04:21] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:04:22] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp1004.wikimedia.org on all recursors
[07:04:25] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp1004.wikimedia.org on all recursors
[07:04:51] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1004.wikimedia.org - slyngshede@cumin1002"
[07:05:50] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1004.wikimedia.org - slyngshede@cumin1002"
[07:06:20] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp1004.wikimedia.org with OS bookworm
[07:06:32] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp1004.wikimedia.org with OS bookworm
[07:08:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[07:13:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[07:16:18] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980175 (10SLyngshede-WMF) a:03SLyngshede-WMF
[07:17:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855
[07:17:52] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855
[07:17:54] <stashbot>	 T369855: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855
[07:17:56] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp1004.wikimedia.org with reason: host reimage
[07:18:06] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "{{done}}" [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) (owner: 10Ladsgroup)
[07:18:48] <jinxer-wm>	 FIRING: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:21:10] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp1004.wikimedia.org with reason: host reimage
[07:22:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1179: Disable notification for db1179 [puppet] - 10https://gerrit.wikimedia.org/r/1054055 (https://phabricator.wikimedia.org/T369855) (owner: 10Ladsgroup)
[07:24:00] <wikibugs>	 (03CR) 10Elukey: [C:03+2] cfssl: add a condition to cfssl_ocsprefresh.py [puppet] - 10https://gerrit.wikimedia.org/r/1053913 (https://phabricator.wikimedia.org/T363829) (owner: 10Elukey)
[07:24:16] <wikibugs>	 (03PS2) 10Jelto: gitlab: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882)
[07:28:01] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetserver::gitprivate: fix post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[07:28:09] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::tcpircbot: allow inbound conn from puppetserver nodes [puppet] - 10https://gerrit.wikimedia.org/r/1053616 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[07:28:18] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::kerberos::kadminserver: allow more nodes in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[07:29:27] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/761927 (https://phabricator.wikimedia.org/T297605)
[07:30:57] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/761927 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup)
[07:34:19] <wikibugs>	 (03PS1) 10Marostegui: packages_wmf.pp: Remove Buster support [puppet] - 10https://gerrit.wikimedia.org/r/1054271
[07:35:26] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Pending checking if there are Busters in WMCS land" [puppet] - 10https://gerrit.wikimedia.org/r/1054271 (owner: 10Marostegui)
[07:36:51] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp1004.wikimedia.org with OS bookworm
[07:36:51] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp1004.wikimedia.org
[07:36:59] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980227 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp1004.wikimedia.org with OS bookworm completed: - idp1004 (**PASS*...
[07:37:13] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[07:44:45] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3223/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[07:46:16] <wikibugs>	 (03CR) 10Jelto: [V:03+1] gitlab: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[07:51:45] <wikibugs>	 (03CR) 10Fabfur: varnish: add requestctl filters for cache hits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto)
[07:53:37] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp2004.wikimedia.org
[07:53:38] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox
[07:55:57] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2004.wikimedia.org - slyngshede@cumin1002"
[07:57:14] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2004.wikimedia.org - slyngshede@cumin1002"
[07:57:14] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:57:15] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp2004.wikimedia.org on all recursors
[07:57:18] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp2004.wikimedia.org on all recursors
[07:57:52] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2004.wikimedia.org - slyngshede@cumin1002"
[07:58:51] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2004.wikimedia.org - slyngshede@cumin1002"
[08:00:14] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980252 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp2004.wikimedia.org with OS bookworm
[08:01:33] <wikibugs>	 (03PS3) 10Jelto: gitlab: switch gitlab from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882)
[08:01:39] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054025 (https://phabricator.wikimedia.org/T369979) (owner: 10Seawolf35gerrit)
[08:04:34] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s7 T369882
[08:04:37] <stashbot>	 T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882
[08:05:17] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T369882
[08:09:49] <logmsgbot>	 !log volans@cumin2002 dbctl commit (dc=all): 'Set db2218 with weight 0 T369882', diff saved to https://phabricator.wikimedia.org/P66474 and previous config saved to /var/cache/conftool/dbconfig/20240715-080948-volans.json
[08:09:53] <stashbot>	 T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882
[08:12:54] <logmsgbot>	 !log volans@cumin2002 dbctl commit (dc=all): 'Remove db2218 from API T369882', diff saved to https://phabricator.wikimedia.org/P66475 and previous config saved to /var/cache/conftool/dbconfig/20240715-081252-volans.json
[08:13:22] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp2004.wikimedia.org with reason: host reimage
[08:16:28] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp2004.wikimedia.org with reason: host reimage
[08:18:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto)
[08:18:29] <wikibugs>	 (03CR) 10Jelto: "unfortunately I was not able to do that. firewall::service expects a array of Array[Stdlib::IP::Address] and not a String. But if we deplo" [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[08:19:10] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794)
[08:19:10] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480)
[08:19:10] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish: actually include the requestctl hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054083 (https://phabricator.wikimedia.org/T369480)
[08:20:19] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3224/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[08:21:59] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 52468
[08:22:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 52468
[08:25:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[08:28:56] <wikibugs>	 (03PS1) 10Elukey: profile::tcpircbot: allow puppetservers to contact tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1054273 (https://phabricator.wikimedia.org/T368023)
[08:30:15] <wikibugs>	 (03PS3) 10Jelto: gitlab: introduce log rotation settings [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837)
[08:30:33] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3225/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054273 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[08:30:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[08:31:02] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "I did a quick search in operations/puppet and cloud/instance-puppet for the class profile::mariadb::packages_wmf" [puppet] - 10https://gerrit.wikimedia.org/r/1054271 (owner: 10Marostegui)
[08:31:29] <wikibugs>	 (03CR) 10Marostegui: packages_wmf.pp: Remove Buster support [puppet] - 10https://gerrit.wikimedia.org/r/1054271 (owner: 10Marostegui)
[08:31:38] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] packages_wmf.pp: Remove Buster support [puppet] - 10https://gerrit.wikimedia.org/r/1054271 (owner: 10Marostegui)
[08:32:29] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:32:35] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3226/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto)
[08:32:55] <wikibugs>	 (03CR) 10Jelto: [V:03+1] gitlab: introduce log rotation settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto)
[08:33:21] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp2004.wikimedia.org with OS bookworm
[08:33:21] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp2004.wikimedia.org
[08:33:30] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980308 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp2004.wikimedia.org with OS bookworm completed: - idp2004 (**PASS*...
[08:34:51] <wikibugs>	 (03PS1) 10Btullis: Add an-redacteddb1001 to the mysql eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453)
[08:35:00] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794)
[08:35:00] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: varnish: add support for hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054082 (https://phabricator.wikimedia.org/T369480)
[08:35:00] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: varnish: actually include the requestctl hit rules [puppet] - 10https://gerrit.wikimedia.org/r/1054083 (https://phabricator.wikimedia.org/T369480)
[08:35:25] <wikibugs>	 (03PS1) 10Slyngshede: P:idp Add idp2004 to CAS 7 cluster. [puppet] - 10https://gerrit.wikimedia.org/r/1054277 (https://phabricator.wikimedia.org/T367487)
[08:35:56] <wikibugs>	 (03PS2) 10Btullis: Add an-redacteddb1001 to the mysql eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453)
[08:36:48] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3227/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis)
[08:39:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Add an-redacteddb1001 to the mysql eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis)
[08:39:33] <wikibugs>	 (03Abandoned) 10Btullis: Revert the change to disable the gobbin timers on an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1052945 (https://phabricator.wikimedia.org/T365503) (owner: 10Btullis)
[08:39:54] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add an-redacteddb1001 to the mysql eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1054275 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis)
[08:41:07] <wikibugs>	 (03PS2) 10Slyngshede: Styling: Allow the use of normal Codex tables. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052923
[08:41:07] <wikibugs>	 (03PS3) 10Slyngshede: Permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924
[08:42:30] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "no objection; i'm wondering whether we should have separate hiera keys for lists that are synced as dry-run and that are synced for real. " [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[08:42:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 (owner: 10Slyngshede)
[08:44:00] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] mailman3: defined type to sync list members, create timers for each list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[08:44:22] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "The status in netbox seems to be "unknown", at least from what puppet reports in its motd. Expected?" [puppet] - 10https://gerrit.wikimedia.org/r/1054277 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[08:45:40] <wikibugs>	 (03CR) 10Slyngshede: "I've only JUST created it, so some lag maybe? Anyway, DNS is there, and that's the bit that's required." [puppet] - 10https://gerrit.wikimedia.org/r/1054277 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[08:45:45] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idp Add idp2004 to CAS 7 cluster. [puppet] - 10https://gerrit.wikimedia.org/r/1054277 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[08:46:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#9980350 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert
[08:46:35] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1053827 (https://phabricator.wikimedia.org/T369882) (owner: 10Gerrit maintenance bot)
[08:47:09] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1053827 (https://phabricator.wikimedia.org/T369882) (owner: 10Gerrit maintenance bot)
[08:47:14] <wikibugs>	 (03PS1) 10Clément Goubert: data.yaml: Add krb access for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1054278 (https://phabricator.wikimedia.org/T369517)
[08:48:14] <wikibugs>	 (03CR) 10Volans: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1053827 (https://phabricator.wikimedia.org/T369882) (owner: 10Gerrit maintenance bot)
[08:48:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE))
[08:49:34] <wikibugs>	 06SRE, 06collaboration-services: gitlab2002: wrong network for pulic IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#9980370 (10Clement_Goubert)
[08:51:16] <volans>	 !log Starting s7 codfw failover from db2121 to db2218 - T369882
[08:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:19] <stashbot>	 T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882
[08:54:52] <wikibugs>	 (03CR) 10Slyngshede: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1054278 (https://phabricator.wikimedia.org/T369517) (owner: 10Clément Goubert)
[08:54:57] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] data.yaml: Add krb access for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1054278 (https://phabricator.wikimedia.org/T369517) (owner: 10Clément Goubert)
[08:55:17] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] data.yaml: Add krb access for xiaoxiao [puppet] - 10https://gerrit.wikimedia.org/r/1054278 (https://phabricator.wikimedia.org/T369517) (owner: 10Clément Goubert)
[08:55:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "yep indeed !" [homer/public] - 10https://gerrit.wikimedia.org/r/1053935 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney)
[08:56:54] <logmsgbot>	 !log volans@cumin1002 dbctl commit (dc=all): 'Promote db2218 to s7 primary T369882', diff saved to https://phabricator.wikimedia.org/P66477 and previous config saved to /var/cache/conftool/dbconfig/20240715-085654-volans.json
[08:56:58] <stashbot>	 T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882
[09:02:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] profile::tcpircbot: allow puppetservers to contact tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1054273 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[09:03:06] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::tcpircbot: allow puppetservers to contact tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1054273 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[09:05:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#9980431 (10Clement_Goubert) 05In progress→03Resolved p:05Triage→03Medium @XiaoXiao-WMF You should have received an email with instructions on...
[09:05:33] <logmsgbot>	 !log volans@cumin1002 dbctl commit (dc=all): 'Depool db2121 T369882', diff saved to https://phabricator.wikimedia.org/P66478 and previous config saved to /var/cache/conftool/dbconfig/20240715-090532-volans.json
[09:05:37] <stashbot>	 T369882: Switchover s7 master (db2121 -> db2218) - https://phabricator.wikimedia.org/T369882
[09:06:56] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: introduce log rotation settings [puppet] - 10https://gerrit.wikimedia.org/r/1053919 (https://phabricator.wikimedia.org/T369837) (owner: 10Jelto)
[09:08:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: mysql: replication lag monitoring threshold and severity change (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[09:09:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9980442 (10Clement_Goubert) 05Open→03Resolved Resolving this as it seems everything is in order. Don't hesitate to reopen should you encounter any issues.
[09:14:40] <wikibugs>	 (03PS1) 10Marostegui: db2121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054281
[09:14:44] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d3-codfw
[09:14:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Long schema change
[09:15:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Long schema change
[09:15:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054281 (owner: 10Marostegui)
[09:15:56] <marostegui>	 !log Deploy schema change on s7 codfw db2121 dbmaint T367856
[09:15:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:00] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[09:16:54] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-d3-codfw
[09:17:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:17:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:18:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T367856)', diff saved to https://phabricator.wikimedia.org/P66479 and previous config saved to /var/cache/conftool/dbconfig/20240715-091800-marostegui.json
[09:18:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Long schema change
[09:18:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Long schema change
[09:19:03] <marostegui>	 !log Deploy schema change on s7 eqiad db1170 dbmaint T367856
[09:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:44] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:23:26] <wikibugs>	 (03PS1) 10Clément Goubert: Revert^2 "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054283
[09:25:20] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:29:05] <claime>	 !log manually removing mw1349.eqiad.wmnet mw1350.eqiad.wmnet mw1351.eqiad.wmnet from k8s following reimage to videoscalers - T351074
[09:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:09] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[09:33:33] <jinxer-wm>	 RESOLVED: [3x] KubernetesCalicoDown: mw1349.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:38:36] <wikibugs>	 (03PS4) 10Ayounsi: Netbox 4: create parent directories [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275)
[09:38:37] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9980527 (10Joe) To clarify a bit, I didn't take the route described in the task. In fact, we want:  *...
[09:41:16] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9980543 (10Joe) a:03Joe
[09:41:25] <wikibugs>	 (03CR) 10Ayounsi: Netbox 4: create parent directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:41:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[09:41:47] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Netbox 4: create parent directories [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:42:01] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1053877 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[09:46:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[09:50:51] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Netbox 4: create customscript parent directory as well [puppet] - 10https://gerrit.wikimedia.org/r/1048402 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:53:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288
[09:54:33] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 49544
[09:55:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288 (owner: 10Filippo Giunchedi)
[09:56:49] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 49544
[09:57:35] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61941
[09:58:14] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61941
[09:58:22] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 262293
[09:58:34] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262293
[09:58:43] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 270361
[09:58:57] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270361
[09:59:19] <wikibugs>	 (03PS1) 10Elukey: profile::pki::multirootca: use info in the client auth vhost [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750)
[09:59:26] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52999
[09:59:38] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52999
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1000)
[10:00:14] <wikibugs>	 (03PS2) 10Elukey: profile::pki::multirootca: use info in the client auth vhost [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750)
[10:01:16] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3228/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey)
[10:01:40] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] profile::pki::multirootca: use info in the client auth vhost [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey)
[10:04:49] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::pki::multirootca: use info in the client auth vhost [puppet] - 10https://gerrit.wikimedia.org/r/1054289 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey)
[10:14:53] <wikibugs>	 (03PS2) 10Filippo Giunchedi: o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288
[10:20:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[10:20:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[10:21:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[10:21:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T367856)', diff saved to https://phabricator.wikimedia.org/P66480 and previous config saved to /var/cache/conftool/dbconfig/20240715-102117-marostegui.json
[10:21:21] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[10:23:36] <wikibugs>	 (03PS1) 10Btullis: Correct the signing key for the yarn apt repo [puppet] - 10https://gerrit.wikimedia.org/r/1054296 (https://phabricator.wikimedia.org/T365839)
[10:24:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:24:19] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Correct the signing key for the yarn apt repo [puppet] - 10https://gerrit.wikimedia.org/r/1054296 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis)
[10:25:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[10:29:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:31:18] <claime>	 arnaudb: bunch of 1205 lock wait timeout exceeded errors on mw-jobrunners during two minutes, looks like only commons
[10:32:04] <claime>	 It's stopped now, but it was a good 1k erros
[10:33:56] <arnaudb>	 thanks for the heads up claime, cc marostegui 
[10:37:49] <arnaudb>	 claime: mwjobrunners are hitting clouddb, no?
[10:42:21] <claime>	 arnaudb: err I don't think they should
[10:42:37] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed in moss-be2002 - https://phabricator.wikimedia.org/T370034 (10MatthewVernon) 03NEW
[10:42:42] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed in moss-be2002 - https://phabricator.wikimedia.org/T370034#9980767 (10MatthewVernon) p:05Triage→03Medium
[10:50:04] <wikibugs>	 (03CR) 10Clément Goubert: "> > Do you expect wgMetricsPlatformInstrumentConfiguratorBaseUrl to be different per-wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[10:51:14] <wikibugs>	 (03PS1) 10Marostegui: db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054303
[10:51:49] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1054303 (owner: 10Marostegui)
[10:59:18] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:02:35] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Revert^2 "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054283 (owner: 10Clément Goubert)
[11:02:46] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert^2 "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054283 (owner: 10Clément Goubert)
[11:03:51] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "changeprop-jobqueue: increase prioritised video concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054283 (owner: 10Clément Goubert)
[11:08:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[11:08:33] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[11:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:30] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:11:02] <claime>	 !log Increasing webVideoTranscodePrioritized concurrency in changeprop-jobqueue
[11:11:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:07] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[11:11:47] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:12:49] <wikibugs>	 (03CR) 10Urbanecm: "reviewing per a request from Seddon :). logged a few questions inline!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant)
[11:14:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:19:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#9980861 (10MatthewVernon) So, looking at [[ https://netbox.wikimedia.org/dcim/devices/?q=ms-be2&sort=rack | netbox ]], hosts are distributed in cod...
[11:24:41] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[11:25:07] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:29:18] <jinxer-wm>	 FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:30:21] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[11:30:36] <jinxer-wm>	 FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:31:35] <marostegui>	 !log Reboot stashbot
[11:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:17] <marostegui>	 !log test
[11:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:15] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29689 bytes in 2.431 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[11:36:49] <wikibugs>	 (03CR) 10Seddon: Enable account vanishing in CentralAuth (labs). (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant)
[11:37:13] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:37:53] <urbanecm>	 jouncebot: nowandnext
[11:37:53] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 22 minute(s)
[11:37:53] <jouncebot>	 In 1 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1300)
[11:38:37] <wikibugs>	 (03PS4) 10Jelto: gitlab: switch gitlab from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882)
[11:39:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9980908 (10Volans)
[11:40:04] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] "Makes sense. Seddon mentioned this is becoming essential, and none of the questions logged is a critical one, so let's ship this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant)
[11:40:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9980911 (10Volans) Confirmed it's all good for this specific task, marked as such in the task description.
[11:40:41] <wikibugs>	 (03Merged) 10jenkins-bot: Enable account vanishing in CentralAuth (labs). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) (owner: 10Dbrant)
[11:42:02] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3229/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053879 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[11:44:18] <jinxer-wm>	 FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:46:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9980918 (10Clement_Goubert) a:03KFrancis @KFrancis can you please confirm NDA status?
[11:49:18] <jinxer-wm>	 FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:57:56] <wikibugs>	 (03PS1) 10DCausse: team-search-platform: migrate cirrus_cluster_checks [alerts] - 10https://gerrit.wikimedia.org/r/1054317 (https://phabricator.wikimedia.org/T359033)
[11:58:01] <wikibugs>	 (03PS5) 10AOkoth: vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078)
[11:59:46] <wikibugs>	 (03PS2) 10DCausse: rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053734
[12:00:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[12:04:39] <godog>	 I'm investingating those otelcollector alerts btw
[12:05:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[12:07:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[12:12:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[12:15:38] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[12:16:01] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[12:20:36] <jinxer-wm>	 FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:22:49] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054328
[12:24:00] <wikibugs>	 (03PS1) 10Stevemunene: Upgrade airflow test instance version to v2.9.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054329 (https://phabricator.wikimedia.org/T365449)
[12:24:18] <jinxer-wm>	 FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:26:22] <wikibugs>	 (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053734 (owner: 10DCausse)
[12:27:15] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053734 (owner: 10DCausse)
[12:30:08] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[12:30:10] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[12:30:45] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[12:30:47] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[12:37:10] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2024-07-15-100650-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054340 (https://phabricator.wikimedia.org/T354666)
[12:39:14] <wikibugs>	 (03CR) 10Klausman: [C:03+2] ml: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[12:39:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[12:39:51] <wikibugs>	 (03Merged) 10jenkins-bot: ml: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[12:40:12] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:41:00] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:41:32] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:41:40] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:43:54] <wikibugs>	 (03CR) 10Vgutierrez: "looking good, just an inline question about templates" [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto)
[12:44:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[12:51:26] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] Add public suffix list module (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall)
[12:52:55] <wikibugs>	 (03CR) 10Vgutierrez: ncmonitor: Set path for public suffix domain list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall)
[12:54:18] <jinxer-wm>	 FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:54:24] <wikibugs>	 (03PS14) 10Stevemunene: wdqs: add main and scholarly puppet config [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364)
[12:54:25] <wikibugs>	 (03PS1) 10Stevemunene: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364)
[12:55:49] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:56:21] <wikibugs>	 (03PS1) 10MVernon: hiera: mark apus service as in production [puppet] - 10https://gerrit.wikimedia.org/r/1054344 (https://phabricator.wikimedia.org/T279621)
[12:58:23] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495)
[12:58:24] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "sorry about that, I was under the wrong impression that you took care of it" [puppet] - 10https://gerrit.wikimedia.org/r/1053937 (https://phabricator.wikimedia.org/T355750) (owner: 10Elukey)
[12:59:08] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495)
[12:59:18] <jinxer-wm>	 FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:59:28] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "(PS3 just adds a trailing comma 🙂)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE))
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1300).
[13:00:05] <jouncebot>	 Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:49] <Lucas_WMDE>	 o/
[13:00:53] <Lucas_WMDE>	 I can deploy ^^
[13:01:35] <wikibugs>	 (03CR) 10Stevemunene: wdqs: add main and scholarly puppet config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene)
[13:01:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE))
[13:02:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495) (owner: 10Lucas Werkmeister (WMDE))
[13:02:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1052699|Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] (T369495)]]
[13:02:37] <stashbot>	 T369495: Make `haswbstatement:` work for the EntitySchema property - https://phabricator.wikimedia.org/T369495
[13:05:58] <wikibugs>	 (03PS1) 10MVernon: apus: add active/active geoip service record [dns] - 10https://gerrit.wikimedia.org/r/1054346 (https://phabricator.wikimedia.org/T279621)
[13:08:20] <Lucas_WMDE>	 k8s image build feels like it’s taking unusually long
[13:08:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9981160 (10elukey) As FYI we already have T367970 to upgrade pxelinux to 6.04, but IIRC we already manually tested it and it didn't fix the issue (that...
[13:08:38] <Lucas_WMDE>	 maybe because it’s the first build this week?
[13:09:18] <Lucas_WMDE>	 ok now it’s done (took 6½ minutes all in all)
[13:09:20] <wikibugs>	 (03PS1) 10MVernon: hiera: use discovery hostname in apus probes [puppet] - 10https://gerrit.wikimedia.org/r/1054347 (https://phabricator.wikimedia.org/T279621)
[13:11:48] <Lucas_WMDE>	 docker_pull_k8s also taking much longer than usual
[13:14:18] <jinxer-wm>	 FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:15:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1052699|Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] (T369495)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:15:43] <stashbot>	 T369495: Make `haswbstatement:` work for the EntitySchema property - https://phabricator.wikimedia.org/T369495
[13:15:47] <Lucas_WMDE>	 alright, let’s test
[13:17:26] <Lucas_WMDE>	 hm, not seeing any changes so far…
[13:18:55] <Lucas_WMDE>	 anybody happen to know how I can force re-indexing of a page?
[13:19:00] <Lucas_WMDE>	 I already edited it but it seems to have had no effect
[13:19:18] <jinxer-wm>	 FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:20:00] <Lucas_WMDE>	 oh, I should look at logstash
[13:20:36] <jinxer-wm>	 FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:20:45] <Lucas_WMDE>	 hm, nothing there AFAICT
[13:22:25] <Lucas_WMDE>	 ok https://www.wikidata.org/wiki/Q4115189?action=cirrusDump just updated
[13:22:27] <Lucas_WMDE>	 guess it was delayed
[13:22:38] <Lucas_WMDE>	 P12886 is in outgoing_link now
[13:22:47] <Lucas_WMDE>	 but not in statement_keywords 😔
[13:24:33] <Lucas_WMDE>	 although… if the search updating is delayed / async
[13:24:40] <Lucas_WMDE>	 then I guess it makes sense that I’m not seeing the effect of my config change yet
[13:24:48] <Lucas_WMDE>	 as the job runner(?) wouldn’t be using mwdebug
[13:25:11] <Lucas_WMDE>	 so I guess I’ll just have to roll it out, watch logstash, and be ready to roll back in case it provokes errors on the job runners
[13:25:42] <Lucas_WMDE>	 let’s go ahead with that then
[13:25:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync
[13:27:53] <dcausse>	 Lucas_WMDE: I seem to see it (P12886=E123) using the cirrus doc build code: https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&revids=2204971309&prop=cirrusbuilddoc
[13:28:08] <Lucas_WMDE>	 yay, thanks!
[13:28:12] <Lucas_WMDE>	 I already forgot that existed
[13:28:12] <dcausse>	 but there's some caching there that makes it hard to test as well
[13:28:21] <Lucas_WMDE>	 ah right
[13:28:28] <Lucas_WMDE>	 the cache that I added mt_rand() to the key in localhost ^^
[13:29:09] <dcausse>	 :)
[13:31:00] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm)
[13:33:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1052699|Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] (T369495)]] (duration: 30m 51s)
[13:33:28] <stashbot>	 T369495: Make `haswbstatement:` work for the EntitySchema property - https://phabricator.wikimedia.org/T369495
[13:35:45] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm)
[13:36:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981304 (10ssingh)
[13:36:32] <wikibugs>	 (03PS1) 10Clément Goubert: turnilo: Fix url shortening [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949)
[13:36:39] <wikibugs>	 (03Merged) 10jenkins-bot: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm)
[13:37:03] <wikibugs>	 (03PS2) 10Clément Goubert: turnilo: Fix url shortening [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949)
[13:37:03] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[13:39:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981315 (10ssingh)
[13:39:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2005.codfw.wmnet with OS bookworm
[13:39:48] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9981320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm
[13:40:27] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Excellent! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[13:40:40] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] turnilo: Fix url shortening [puppet] - 10https://gerrit.wikimedia.org/r/1054348 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[13:41:54] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:12] <_joe_>	 !log uploading conftool 3.1.0 to bookworm,bullseye,buster
[13:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:35] <sukhe>	 _joe_: <3
[13:46:26] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review: db1179 stopped answering ping, depooled - https://phabricator.wikimedia.org/T369855#9981330 (10Marostegui) >>! In T369855#9979761, @Ladsgroup wrote: > Also noting that this is a candidate master.  All hosts in x1 are potential candidate masters. They all ru...
[13:49:18] <jinxer-wm>	 FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:49:36] <sukhe>	 XioNoX: ^ known?
[13:50:04] <XioNoX>	 sukhe: yeah, it's a downtime on the not yet live netbox servers that expired
[13:50:14] <XioNoX>	 I'll re-downtime it
[13:50:27] <sukhe>	 ah ok, 1003 
[13:50:28] <sukhe>	 thanks
[13:50:35] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netboxdb2003.codfw.wmnet with reason: netbox upgrade prep work
[13:50:49] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netboxdb2003.codfw.wmnet with reason: netbox upgrade prep work
[13:51:04] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work
[13:51:18] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work
[13:53:00] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] apus: add active/active geoip service record [dns] - 10https://gerrit.wikimedia.org/r/1054346 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[13:53:36] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: use discovery hostname in apus probes [puppet] - 10https://gerrit.wikimedia.org/r/1054347 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[13:53:46] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: mark apus service as in production [puppet] - 10https://gerrit.wikimedia.org/r/1054344 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[13:53:54] <logmsgbot>	 !log oblivian@puppetmaster2001 conftool action : set/pooled=yes; selector: name=mw1386.*,cluster=kubernetes,dc=eqiad [reason: Test conftool sal logging]
[13:54:03] <_joe_>	 sukhe: ^^
[13:54:06] <sukhe>	 :D 
[13:54:32] <_joe_>	 there is a problem though to install the new version on the other puppetmasters
[13:54:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2005.codfw.wmnet with reason: host reimage
[13:55:00] <sukhe>	 _joe_: what kind of issue?
[13:56:09] <wikibugs>	 (03CR) 10Elukey: [C:03+1] pyrra: add liftwing SLOs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[13:57:40] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed in moss-be2002 - https://phabricator.wikimedia.org/T370034#9981365 (10Jhancock.wm) The drive was blinking. thanks for that. The disk has been replaced.
[13:57:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981361 (10ayounsi) If I was paranoid, I'd say it's possibly a bug being exploited that can cause a DDoS and we should prioritize T364092.  We have a couple runbooks that could fit the sit...
[13:58:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2005.codfw.wmnet with reason: host reimage
[13:58:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981370 (10ssingh) >>! In T370048#9981361, @ayounsi wrote: > If I was paranoid, I'd say it's possibly a bug being exploited that can cause a DDoS and we should prioritize T364092. >  > We...
[13:59:26] <_joe_>	 sukhe: we have some rules in requestctl that are supposedly cache_miss_only: false
[13:59:30] <_joe_>	 and they'd be moved out
[13:59:39] <_joe_>	 they all seem old stuff that shouldn't be there atm
[14:00:07] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] MediaWikiPingback is now on event platform. Use eventlogging_legacy refine job [puppet] - 10https://gerrit.wikimedia.org/r/1050008 (https://phabricator.wikimedia.org/T323828) (owner: 10Ottomata)
[14:00:23] <_joe_>	 or, we can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054081/ and followups
[14:00:43] <_joe_>	 actually, I think I'll upgrade
[14:00:46] <sukhe>	 noted. hth if you need a second pair of eyes
[14:03:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9981375 (10VRiley-WMF) Hey @fgiunchedi sorry for the late response. I am available to work on this today. Please be aware, we will have to physically move the server in order to plug in a 10Gbit con...
[14:03:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9981376 (10VRiley-WMF) a:03VRiley-WMF
[14:04:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9981382 (10VRiley-WMF) @Eevans Would we be able to move forward with this today or tomorrow? Let us know, thanks!
[14:06:42] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[14:06:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[14:06:57] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:07:13] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:07:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66483 and previous config saved to /var/cache/conftool/dbconfig/20240715-140720-arnaudb.json
[14:07:38] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[14:09:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66484 and previous config saved to /var/cache/conftool/dbconfig/20240715-140934-arnaudb.json
[14:11:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed in moss-be2002 - https://phabricator.wikimedia.org/T370034#9981395 (10Jhancock.wm) a:03Jhancock.wm
[14:13:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2005.codfw.wmnet with OS bookworm
[14:13:35] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9981400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm completed: - dbproxy...
[14:15:55] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: readability_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054080 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira)
[14:15:55] <wikibugs>	 (03CR) 10Aqu: [C:03+1] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052762 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[14:16:40] <_joe_>	 !log updating conftool to 3.1.0 fleet wide
[14:16:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: varnish: add requestctl filters for cache hits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054081 (https://phabricator.wikimedia.org/T317794) (owner: 10Giuseppe Lavagetto)
[14:19:14] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9981439 (10Marostegui) @Papaul dbproxy2005 looks good now - no ipv6 and I can reach it just fine. If you want to move it back to 10G that's great, and if you'd want t...
[14:24:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P66485 and previous config saved to /var/cache/conftool/dbconfig/20240715-142441-arnaudb.json
[14:25:27] <wikibugs>	 (03CR) 10Herron: [C:03+1] o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288 (owner: 10Filippo Giunchedi)
[14:36:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] o11y: alert on benthos-webrequest-sampled lag [alerts] - 10https://gerrit.wikimedia.org/r/1054288 (owner: 10Filippo Giunchedi)
[14:39:18] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P66486 and previous config saved to /var/cache/conftool/dbconfig/20240715-143948-arnaudb.json
[14:44:12] <wikibugs>	 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q1): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#9981546 (10fgiunchedi) Thank you @LSobanski ! I'll be reaching out to the individual service owners
[14:45:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9981563 (10Eevans) >>! In T368766#9981382, @VRiley-WMF wrote: > @Eevans Would we be able to move forward with this today or tomorrow? Let us know, thanks!  Sure, that works.  Let me know when!
[14:45:34] <wikibugs>	 (03CR) 10EoghanGaffney: "One more small comment, after that I think it's good to go." [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth)
[14:48:22] <wikibugs>	 (03PS1) 10Gmodena: eventbus: enable instrumentation on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054357 (https://phabricator.wikimedia.org/T363587)
[14:49:59] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033
[14:50:03] <stashbot>	 T362033: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033
[14:50:13] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033
[14:50:18] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9981616 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9483e0b8-53c7-4b67-8ac7-0ee42edaeba5) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with r...
[14:52:34] <wikibugs>	 (03CR) 10AOkoth: vrts: fix proxy for download (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth)
[14:53:51] <wikibugs>	 (03PS6) 10AOkoth: vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078)
[14:54:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T367781)', diff saved to https://phabricator.wikimedia.org/P66487 and previous config saved to /var/cache/conftool/dbconfig/20240715-145455-arnaudb.json
[14:54:58] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[14:55:00] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[14:55:11] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[14:55:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66488 and previous config saved to /var/cache/conftool/dbconfig/20240715-145517-arnaudb.json
[14:57:04] <wikibugs>	 (03CR) 10AOkoth: vrts: fix proxy for download (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth)
[14:57:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66489 and previous config saved to /var/cache/conftool/dbconfig/20240715-145728-arnaudb.json
[14:58:17] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: readability_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054080 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira)
[14:59:06] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: readability_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054080 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira)
[14:59:18] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aqs1013-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:59:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:36] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:06:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:06:26] <sukhe>	 hmm
[15:07:11] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:07:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9981696 (10VRiley-WMF) 05Open→03Resolved I have placed the HDD's back into the original server and have booted it up. Since this ticket is specific for the SSH/Managment mismatch, I'll be closing this ticket.
[15:09:57] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:10:36] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:11:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:12:02] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:12:35] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:12:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P66490 and previous config saved to /var/cache/conftool/dbconfig/20240715-151235-arnaudb.json
[15:13:14] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:13:36] <logmsgbot>	 !log mnz@deploy1002 Started deploy [airflow-dags/research@5121748]: (no justification provided)
[15:14:08] <logmsgbot>	 !log mnz@deploy1002 Finished deploy [airflow-dags/research@5121748]: (no justification provided) (duration: 00m 31s)
[15:14:18] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:44] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T370062 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:15:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T370062 (10ops-monitoring-bot) 03NEW
[15:16:18] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:16:50] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:17:19] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:26:05] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9981821 (10Volans) Sorry if I'm late to the task, I discovered it just today as I was not subscribed to it.  Allow me to be really sad that in this whole discu...
[15:27:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P66491 and previous config saved to /var/cache/conftool/dbconfig/20240715-152742-arnaudb.json
[15:28:47] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlserve@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:31:35] <jouncebot>	 jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1530).
[15:31:58] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work
[15:32:12] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work
[15:34:49] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: disable pint promql/series for BenthosKafkaConsumerLag + webrequest [alerts] - 10https://gerrit.wikimedia.org/r/1054363 (https://phabricator.wikimedia.org/T369737)
[15:36:42] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049915 (https://phabricator.wikimedia.org/T356924) (owner: 10Dreamy Jazz)
[15:37:13] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[15:42:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 (10ssingh) 03NEW
[15:42:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66492 and previous config saved to /var/cache/conftool/dbconfig/20240715-154250-arnaudb.json
[15:42:52] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[15:42:55] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[15:43:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[15:43:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66493 and previous config saved to /var/cache/conftool/dbconfig/20240715-154312-arnaudb.json
[15:45:08] <wikibugs>	 06SRE, 06collaboration-services: gitlab2002: wrong network for pulic IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#9982100 (10LSobanski) a:03Jelto
[15:45:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66494 and previous config saved to /var/cache/conftool/dbconfig/20240715-154526-arnaudb.json
[15:46:20] <wikibugs>	 06SRE, 06collaboration-services: gitlab2002: wrong network for pulic IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#9982117 (10LSobanski) p:05Triage→03Medium
[15:46:21] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 102586240 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:47:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 64920 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:47:29] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:47:35] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:53:47] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlserve@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:57:57] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9982145 (10ssingh) >>! In T369366#9981821, @Volans wrote: > Sorry if I'm late to the task, I discovered it just today as I was not subscribed to it. >  > Allow...
[15:59:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9982167 (10Jhancock.wm) Got the card back @fgiunchedi. I'm free to swap it anytime on Tuesday or Thursday between 8am and 4pm CDT
[16:00:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P66495 and previous config saved to /var/cache/conftool/dbconfig/20240715-160033-arnaudb.json
[16:02:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9982213 (10fgiunchedi) >>! In T369826#9982167, @Jhancock.wm wrote: > Got the card back @fgiunchedi. I'm free to swap it anytime on Tuesday or Thursday between 8am and 4pm CDT  Thank you ! I'm good w...
[16:06:35] <wikibugs>	 (03PS1) 10Effie Mouzeli: mcrouter: test bookworm image on mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366)
[16:11:49] <wikibugs>	 (03PS1) 10Effie Mouzeli: mcrouter: test bookworm image on mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366)
[16:14:38] <wikibugs>	 (03CR) 10Herron: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[16:15:06] <wikibugs>	 06SRE, 06collaboration-services: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#9982267 (10Aklapper)
[16:15:35] <icinga-wm>	 RECOVERY - dump of s6 in codfw on backupmon1001 is OK: Last dump for s6 at codfw (db2197) taken on 2024-07-15 14:49:19 (74 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[16:15:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P66496 and previous config saved to /var/cache/conftool/dbconfig/20240715-161541-arnaudb.json
[16:16:45] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Can it be merged and deployed for real now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[16:18:30] <wikibugs>	 (03CR) 10Elukey: "Thanks! Is it possible that the new image config is misaligned? I don't see it in the CI's diff :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli)
[16:26:29] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:28:30] <wikibugs>	 (03PS1) 10AOkoth: vrts: change root mail alias [puppet] - 10https://gerrit.wikimedia.org/r/1054369 (https://phabricator.wikimedia.org/T369445)
[16:28:47] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth)
[16:29:01] <wikibugs>	 (03PS2) 10Effie Mouzeli: mcrouter: test bookworm image on mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054368 (https://phabricator.wikimedia.org/T368366)
[16:29:17] <wikibugs>	 (03PS1) 10Ssingh: Release 0.9.8-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068)
[16:30:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T367781)', diff saved to https://phabricator.wikimedia.org/P66497 and previous config saved to /var/cache/conftool/dbconfig/20240715-163048-arnaudb.json
[16:30:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[16:30:52] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[16:31:03] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[16:31:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T367781)', diff saved to https://phabricator.wikimedia.org/P66498 and previous config saved to /var/cache/conftool/dbconfig/20240715-163110-arnaudb.json
[16:31:29] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:33:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T367781)', diff saved to https://phabricator.wikimedia.org/P66499 and previous config saved to /var/cache/conftool/dbconfig/20240715-163320-arnaudb.json
[16:36:16] <wikibugs>	 (03PS1) 10Arlolra: Revert changes in log levels [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371
[16:38:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "good idea, let's try it. make sure to send some test mail though" [puppet] - 10https://gerrit.wikimedia.org/r/1054369 (https://phabricator.wikimedia.org/T369445) (owner: 10AOkoth)
[16:38:45] <wikibugs>	 (03CR) 10Effie Mouzeli: "I think it is just the CI, I will get back to you as soon as know for sure" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054367 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli)
[16:43:32] <wikibugs>	 (03CR) 10AOkoth: "Yeah, I can try that after merging this." [puppet] - 10https://gerrit.wikimedia.org/r/1054369 (https://phabricator.wikimedia.org/T369445) (owner: 10AOkoth)
[16:44:34] <wikibugs>	 (03PS1) 10Dzahn: remove git.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1054372 (https://phabricator.wikimedia.org/T323073)
[16:45:30] <wikibugs>	 (03CR) 10Dzahn: "I will start with DNS first since it's trivial to revert just in case. After a little waiting period then coming back to this." [puppet] - 10https://gerrit.wikimedia.org/r/1006982 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn)
[16:45:40] <wikibugs>	 (03CR) 10Dzahn: "I will start with DNS first since it's trivial to revert just in case. After a little waiting period then coming back to this." [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn)
[16:47:29] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] vrts: change root mail alias [puppet] - 10https://gerrit.wikimedia.org/r/1054369 (https://phabricator.wikimedia.org/T369445) (owner: 10AOkoth)
[16:48:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P66500 and previous config saved to /var/cache/conftool/dbconfig/20240715-164827-arnaudb.json
[16:48:50] <wikibugs>	 (03PS2) 10DCausse: team-search-platform: migrate cirrus_cluster_checks [alerts] - 10https://gerrit.wikimedia.org/r/1054317 (https://phabricator.wikimedia.org/T359033)
[16:48:50] <wikibugs>	 (03PS1) 10DCausse: team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033)
[16:50:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033) (owner: 10DCausse)
[16:51:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826#9982422 (10Jhancock.wm) We won't need to move racks. But because of the way the switches are, we can't reuse the same port on the switch. we'll be moving to a different set of 4. Are you going to re...
[16:55:25] <wikibugs>	 (03PS2) 10DCausse: team-search-platform: migrate cirrus latencies & mem alert [alerts] - 10https://gerrit.wikimedia.org/r/1054374 (https://phabricator.wikimedia.org/T359033)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1700)
[17:00:05] <jouncebot>	 ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T1700).
[17:03:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P66501 and previous config saved to /var/cache/conftool/dbconfig/20240715-170334-arnaudb.json
[17:06:29] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:08:31] <wikibugs>	 (03Abandoned) 10Urbanecm: lists::automation: Update stewards-l in real mode [puppet] - 10https://gerrit.wikimedia.org/r/1052188 (https://phabricator.wikimedia.org/T351202) (owner: 10Urbanecm)
[17:08:45] <wikibugs>	 (03PS1) 10Papaul: Add frand200[1-2] to DNS files [dns] - 10https://gerrit.wikimedia.org/r/1054377
[17:12:56] <wikibugs>	 (03PS2) 10Scott French: kserve-inference: update references to deprecated services in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949)
[17:14:11] <wikibugs>	 (03PS2) 10Scott French: mobileapps: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053806 (https://phabricator.wikimedia.org/T367949)
[17:14:11] <wikibugs>	 (03PS2) 10Scott French: push-notifications: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053807 (https://phabricator.wikimedia.org/T367949)
[17:14:11] <wikibugs>	 (03PS2) 10Scott French: wikifeeds: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053808 (https://phabricator.wikimedia.org/T367949)
[17:17:21] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+1] "Those hostnames and IPs look good and in the correct ranges. Shipit." [dns] - 10https://gerrit.wikimedia.org/r/1054377 (owner: 10Papaul)
[17:17:50] <wikibugs>	 (03CR) 10Scott French: [C:03+2] kserve-inference: update references to deprecated services in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French)
[17:18:06] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add frand200[1-2] to DNS files [dns] - 10https://gerrit.wikimedia.org/r/1054377 (owner: 10Papaul)
[17:18:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T367781)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240715-171841-arnaudb.json
[17:18:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[17:18:52] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[17:19:01] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[17:19:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T367781)', diff saved to https://phabricator.wikimedia.org/P66503 and previous config saved to /var/cache/conftool/dbconfig/20240715-171908-arnaudb.json
[17:19:36] <wikibugs>	 (03CR) 10Scott French: "Alas, forgot to bump the chart version in this one before (done)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053808 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French)
[17:19:38] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9982548 (10Volans) Thanks for the clarification. I didn't meant to imply that you didn't want a cookbook as end goal (although it was not mentioned).  >>! In T...
[17:19:56] <wikibugs>	 (03Merged) 10jenkins-bot: kserve-inference: update references to deprecated services in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French)
[17:21:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T367781)', diff saved to https://phabricator.wikimedia.org/P66504 and previous config saved to /var/cache/conftool/dbconfig/20240715-172118-arnaudb.json
[17:23:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9982576 (10Papaul)
[17:36:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P66505 and previous config saved to /var/cache/conftool/dbconfig/20240715-173625-arnaudb.json
[17:38:09] <wikibugs>	 (03PS2) 10Ssingh: Release 0.9.8-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068)
[17:40:55] <logmsgbot>	 !log mnz@deploy1002 Started deploy [airflow-dags/research@5121748]: (no justification provided)
[17:41:06] <logmsgbot>	 !log mnz@deploy1002 Finished deploy [airflow-dags/research@5121748]: (no justification provided) (duration: 00m 10s)
[17:41:27] <wikibugs>	 (03CR) 10Ssingh: "I think this is low to medium priority but ready for review. OK build on build2001:" [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh)
[17:42:54] <wikibugs>	 (03CR) 10Ssingh: "The bullseye packages are not updated because the hosts are on bullseye so there is no need for us to follow suit with 0.9.8 there." [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1054370 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh)
[17:51:08] <wikibugs>	 (03PS1) 10Arlolra: Revert "Change Linter log level to info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382
[17:51:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P66506 and previous config saved to /var/cache/conftool/dbconfig/20240715-175133-arnaudb.json
[17:55:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 (owner: 10Arlolra)
[17:55:49] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] Revert "Change Linter log level to info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra)
[17:56:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra)
[17:56:04] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] Revert changes in log levels [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 (owner: 10Arlolra)
[17:58:28] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] mailman3: defined type to sync list members, create timers for each list [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[18:01:01] <wikibugs>	 (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[18:04:15] <herron>	 !log upgraded prometheus-ipmi-exporter to 1.8.0 T368088
[18:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:29] <stashbot>	 T368088: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088
[18:04:47] <wikibugs>	 (03PS3) 10Herron: prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088)
[18:06:04] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9982911 (10ssingh) >>! In T369366#9982548, @Volans wrote: > Thanks for the clarification. I didn't meant to imply that you didn't want a cookbook as end goal (...
[18:06:24] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "Looks reasonable, PCC reports mostly what i expect. It suspiciously claims a bunch of lines added and non removed in /etc/wdqs/allowlist-w" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[18:06:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T367781)', diff saved to https://phabricator.wikimedia.org/P66507 and previous config saved to /var/cache/conftool/dbconfig/20240715-180640-arnaudb.json
[18:06:42] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[18:06:53] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[18:06:55] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[18:07:06] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[18:07:11] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs restart envoy: support graph split aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1053778 (https://phabricator.wikimedia.org/T364077)
[18:07:19] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[18:07:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T367781)', diff saved to https://phabricator.wikimedia.org/P66508 and previous config saved to /var/cache/conftool/dbconfig/20240715-180726-arnaudb.json
[18:09:29] <wikibugs>	 (03CR) 10Herron: [C:03+2] prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) (owner: 10Herron)
[18:09:36] <wikibugs>	 (03PS4) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069
[18:09:36] <wikibugs>	 (03PS3) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114)
[18:09:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T367781)', diff saved to https://phabricator.wikimedia.org/P66509 and previous config saved to /var/cache/conftool/dbconfig/20240715-180937-arnaudb.json
[18:10:14] <wikibugs>	 (03CR) 10BCornwall: Add public suffix list module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall)
[18:10:41] <wikibugs>	 (03PS8) 10Jdlrobson: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795)
[18:10:50] <wikibugs>	 (03CR) 10BCornwall: Add public suffix list module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall)
[18:11:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson)
[18:11:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson)
[18:11:41] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3232/console" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall)
[18:13:11] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091 (10Quiddity) 03NEW
[18:13:29] <wikibugs>	 (03CR) 10BCornwall: ncmonitor: Set path for public suffix domain list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall)
[18:15:32] <wikibugs>	 (03PS4) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114)
[18:16:27] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3234/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall)
[18:22:16] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs restart envoy: support graph split aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1053778 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper)
[18:24:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66510 and previous config saved to /var/cache/conftool/dbconfig/20240715-182426-root.json
[18:24:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66511 and previous config saved to /var/cache/conftool/dbconfig/20240715-182436-root.json
[18:24:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P66512 and previous config saved to /var/cache/conftool/dbconfig/20240715-182444-arnaudb.json
[18:25:38] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054385
[18:25:54] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054386
[18:26:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2121: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054386 (owner: 10Marostegui)
[18:26:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1054385 (owner: 10Marostegui)
[18:39:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66513 and previous config saved to /var/cache/conftool/dbconfig/20240715-183931-root.json
[18:39:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66514 and previous config saved to /var/cache/conftool/dbconfig/20240715-183942-root.json
[18:39:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P66515 and previous config saved to /var/cache/conftool/dbconfig/20240715-183952-arnaudb.json
[18:42:42] <wikibugs>	 (03PS1) 10Dzahn: mailman3: add missing whitespace in sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1054388 (https://phabricator.wikimedia.org/T351202)
[18:45:13] <wikibugs>	 (03CR) 10Ssingh: Add public suffix list module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall)
[18:48:35] <wikibugs>	 (03CR) 10Ssingh: Add public suffix list module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall)
[18:54:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66516 and previous config saved to /var/cache/conftool/dbconfig/20240715-185437-root.json
[18:54:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] mailman3: add missing whitespace in sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1054388 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[18:54:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66517 and previous config saved to /var/cache/conftool/dbconfig/20240715-185447-root.json
[18:55:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T367781)', diff saved to https://phabricator.wikimedia.org/P66518 and previous config saved to /var/cache/conftool/dbconfig/20240715-185459-arnaudb.json
[18:55:01] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[18:55:03] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[18:55:14] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[18:55:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T367781)', diff saved to https://phabricator.wikimedia.org/P66519 and previous config saved to /var/cache/conftool/dbconfig/20240715-185521-arnaudb.json
[18:57:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T367781)', diff saved to https://phabricator.wikimedia.org/P66520 and previous config saved to /var/cache/conftool/dbconfig/20240715-185736-arnaudb.json
[18:59:18] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:09:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66521 and previous config saved to /var/cache/conftool/dbconfig/20240715-190942-root.json
[19:09:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66522 and previous config saved to /var/cache/conftool/dbconfig/20240715-190953-root.json
[19:12:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P66523 and previous config saved to /var/cache/conftool/dbconfig/20240715-191243-arnaudb.json
[19:16:23] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.144`. Pre-deploy tests passing on canary `wdqs1016`
[19:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:36] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@9ad2bec]: 0.3.144
[19:17:04] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.144` on canary `wdqs1016`; proceeding to rest of fleet
[19:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1098-1099].eqiad.wmnet with reason: T348977
[19:23:54] <stashbot>	 T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977
[19:24:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1098-1099].eqiad.wmnet with reason: T348977
[19:24:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66524 and previous config saved to /var/cache/conftool/dbconfig/20240715-192448-root.json
[19:24:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic109[8-9]* for T348977 - bking@cumin2002
[19:24:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic109[8-9]* for T348977 - bking@cumin2002
[19:24:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66525 and previous config saved to /var/cache/conftool/dbconfig/20240715-192458-root.json
[19:25:07] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@9ad2bec]: 0.3.144 (duration: 08m 31s)
[19:27:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P66526 and previous config saved to /var/cache/conftool/dbconfig/20240715-192750-arnaudb.json
[19:28:22] <wikibugs>	 (03PS8) 10Ryan Kemper: wdqs: enable throttling only for reqs from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[19:29:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: enable throttling only for reqs from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[19:34:46] <wikibugs>	 (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[19:36:39] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[19:37:13] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[19:39:00] <wikibugs>	 (03CR) 10Krinkle: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy)
[19:39:29] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29691 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[19:39:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66527 and previous config saved to /var/cache/conftool/dbconfig/20240715-193953-root.json
[19:40:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66528 and previous config saved to /var/cache/conftool/dbconfig/20240715-194004-root.json
[19:42:17] <wikibugs>	 (03PS9) 10Ryan Kemper: wdqs: enable throttling only for reqs from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[19:42:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T367781)', diff saved to https://phabricator.wikimedia.org/P66529 and previous config saved to /var/cache/conftool/dbconfig/20240715-194257-arnaudb.json
[19:42:59] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[19:43:02] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[19:43:13] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[19:43:24] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1246.eqiad.wmnet with reason: Maintenance
[19:43:37] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1246.eqiad.wmnet with reason: Maintenance
[19:43:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T367781)', diff saved to https://phabricator.wikimedia.org/P66530 and previous config saved to /var/cache/conftool/dbconfig/20240715-194344-arnaudb.json
[19:46:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T367781)', diff saved to https://phabricator.wikimedia.org/P66531 and previous config saved to /var/cache/conftool/dbconfig/20240715-194559-arnaudb.json
[19:47:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T367856)', diff saved to https://phabricator.wikimedia.org/P66532 and previous config saved to /var/cache/conftool/dbconfig/20240715-194711-marostegui.json
[19:47:16] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[19:51:11] <wikibugs>	 (03CR) 10Ryan Kemper: "WDQS deployed. We'll try merging this and then seeing with tcpdump on wdqs1023 if the appropriate header is set when we do a test federate" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[19:51:18] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: enable throttling only for reqs from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[19:52:07] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[19:55:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66533 and previous config saved to /var/cache/conftool/dbconfig/20240715-195459-root.json
[19:55:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66534 and previous config saved to /var/cache/conftool/dbconfig/20240715-195510-root.json
[19:57:30] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "wdqs: enable throttling only for reqs from the CDN" [puppet] - 10https://gerrit.wikimedia.org/r/1054392
[19:59:39] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: map lines missing trailing ; [puppet] - 10https://gerrit.wikimedia.org/r/1054393 (https://phabricator.wikimedia.org/T361950)
[19:59:54] <wikibugs>	 (03CR) 10Ryan Kemper: "This revert may not be necessary if https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054393 works" [puppet] - 10https://gerrit.wikimedia.org/r/1054392 (owner: 10Ryan Kemper)
[19:59:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "wdqs: enable throttling only for reqs from the CDN" [puppet] - 10https://gerrit.wikimedia.org/r/1054392 (owner: 10Ryan Kemper)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T2000).
[20:00:04] <jouncebot>	 arlolra and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: map lines missing trailing ; [puppet] - 10https://gerrit.wikimedia.org/r/1054393 (https://phabricator.wikimedia.org/T361950) (owner: 10Ryan Kemper)
[20:01:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P66535 and previous config saved to /var/cache/conftool/dbconfig/20240715-200106-arnaudb.json
[20:02:16] <Jdlrobson>	 here
[20:02:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P66536 and previous config saved to /var/cache/conftool/dbconfig/20240715-200218-marostegui.json
[20:07:18] <wikibugs>	 (03PS1) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855)
[20:11:54] <wikibugs>	 (03PS2) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855)
[20:15:13] <Jdlrobson>	 urandom: cjming TheresNoTime RoanKattouw are either of you around to help with a deploy?
[20:16:05] <RoanKattouw>	 Yeah I can deploy
[20:16:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P66537 and previous config saved to /var/cache/conftool/dbconfig/20240715-201613-arnaudb.json
[20:17:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P66538 and previous config saved to /var/cache/conftool/dbconfig/20240715-201726-marostegui.json
[20:17:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson)
[20:18:14] <RoanKattouw>	 arlolra: Are you here for your deployment?
[20:18:30] <arlolra>	 yeah, sorry, I'm just reverting the changes we made last week
[20:18:36] <Jdlrobson>	 thanks RoanKattouw 
[20:18:40] <wikibugs>	 (03Merged) 10jenkins-bot: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson)
[20:18:58] <logmsgbot>	 !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1050082|[July 15th] Deploy dark mode to all logged-in users (T368795)]]
[20:19:04] <stashbot>	 T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795
[20:19:15] <ryankemper>	 !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good
[20:19:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:18] <wikibugs>	 (03PS3) 10CDanis: otelcol: Stop hardcoding k8s master IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054394 (https://phabricator.wikimedia.org/T365855)
[20:19:44] <wikibugs>	 (03CR) 10Catrope: [C:03+2] Revert changes in log levels [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 (owner: 10Arlolra)
[20:22:07] <logmsgbot>	 !log catrope@deploy1002 jdlrobson, catrope: Backport for [[gerrit:1050082|[July 15th] Deploy dark mode to all logged-in users (T368795)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:22:19] <RoanKattouw>	 Jdlrobson: Please test on the test servers
[20:22:34] <wikibugs>	 (03Merged) 10jenkins-bot: Revert changes in log levels [extensions/Linter] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1054371 (owner: 10Arlolra)
[20:22:36] <Jdlrobson>	 RoanKattouw: on it
[20:24:24] <Jdlrobson>	 RoanKattouw: lgtm! Please sync!
[20:24:27] <logmsgbot>	 !log catrope@deploy1002 jdlrobson, catrope: Continuing with sync
[20:29:24] <logmsgbot>	 !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1050082|[July 15th] Deploy dark mode to all logged-in users (T368795)]] (duration: 10m 26s)
[20:29:28] <stashbot>	 T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795
[20:30:39] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 18.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:30:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra)
[20:31:08] <wikibugs>	 (03PS2) 10Arlolra: Revert "Change Linter log level to info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382
[20:31:14] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra)
[20:31:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T367781)', diff saved to https://phabricator.wikimedia.org/P66539 and previous config saved to /var/cache/conftool/dbconfig/20240715-203120-arnaudb.json
[20:31:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[20:31:25] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[20:31:36] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[20:31:44] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2125.codfw.wmnet with reason: Maintenance
[20:31:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Change Linter log level to info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054382 (owner: 10Arlolra)
[20:31:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2125.codfw.wmnet with reason: Maintenance
[20:32:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T367781)', diff saved to https://phabricator.wikimedia.org/P66540 and previous config saved to /var/cache/conftool/dbconfig/20240715-203203-arnaudb.json
[20:32:10] <logmsgbot>	 !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1054371|Revert changes in log levels]], [[gerrit:1054382|Revert "Change Linter log level to info"]]
[20:32:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T367856)', diff saved to https://phabricator.wikimedia.org/P66541 and previous config saved to /var/cache/conftool/dbconfig/20240715-203233-marostegui.json
[20:32:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[20:32:38] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[20:32:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[20:32:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[20:33:11] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: LDAP access to the analytics-privatedata-users group for Quiddity - https://phabricator.wikimedia.org/T370091#9983336 (10MNeisler)
[20:34:32] <logmsgbot>	 !log catrope@deploy1002 arlolra, catrope: Backport for [[gerrit:1054371|Revert changes in log levels]], [[gerrit:1054382|Revert "Change Linter log level to info"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:34:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#9983340 (10Jhancock.wm) @Papaul   these servers have been cabled, bios updated, and pwd set. pending idrac IPs.   franio2001 eth0 <-> FASW-C8A eth-0/0/18 eth1 <-> FASW-C8B...
[20:34:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367781)', diff saved to https://phabricator.wikimedia.org/P66542 and previous config saved to /var/cache/conftool/dbconfig/20240715-203435-arnaudb.json
[20:34:46] <RoanKattouw>	 arlolra: Can this be tested meaningfully or should I just continue to sync?
[20:35:02] <arlolra>	 Just continue the sync, thanks
[20:35:06] <logmsgbot>	 !log catrope@deploy1002 arlolra, catrope: Continuing with sync
[20:35:15] <Jdlrobson>	 Thanks RoanKattouw looks like sync was successful!
[20:37:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[20:39:51] <logmsgbot>	 !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1054371|Revert changes in log levels]], [[gerrit:1054382|Revert "Change Linter log level to info"]] (duration: 07m 41s)
[20:40:27] <RoanKattouw>	 And that's it, all done
[20:41:23] <arlolra>	 Thanks for your time RoanKattouw 
[20:44:39] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[20:44:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[20:46:31] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29705 bytes in 1.310 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[20:49:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[20:49:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P66543 and previous config saved to /var/cache/conftool/dbconfig/20240715-204944-arnaudb.json
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240715T2100)
[21:02:11] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:04:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P66544 and previous config saved to /var/cache/conftool/dbconfig/20240715-210451-arnaudb.json
[21:12:10] <wikibugs>	 (03PS2) 10Dzahn: remove git.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1054372 (https://phabricator.wikimedia.org/T323073)
[21:13:59] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] "Approved. This is the approach the team agreed on" [dns] - 10https://gerrit.wikimedia.org/r/1054372 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn)
[21:15:44] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] remove git.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1054372 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn)
[21:19:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367781)', diff saved to https://phabricator.wikimedia.org/P66545 and previous config saved to /var/cache/conftool/dbconfig/20240715-211957-arnaudb.json
[21:20:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2126.codfw.wmnet with reason: Maintenance
[21:20:04] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[21:20:13] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2126.codfw.wmnet with reason: Maintenance
[21:20:14] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:20:27] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:20:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T367781)', diff saved to https://phabricator.wikimedia.org/P66546 and previous config saved to /var/cache/conftool/dbconfig/20240715-212034-arnaudb.json
[21:20:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052762 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[21:23:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367781)', diff saved to https://phabricator.wikimedia.org/P66547 and previous config saved to /var/cache/conftool/dbconfig/20240715-212302-arnaudb.json
[21:34:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9983460 (10KFrancis) Hello @JJMC89, please send you full name and postal address to kfrancis@wikimedia.org and I'll get the NDA processed.  Thanks!
[21:38:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P66548 and previous config saved to /var/cache/conftool/dbconfig/20240715-213809-arnaudb.json
[21:53:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P66549 and previous config saved to /var/cache/conftool/dbconfig/20240715-215316-arnaudb.json
[22:06:39] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:08:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367781)', diff saved to https://phabricator.wikimedia.org/P66550 and previous config saved to /var/cache/conftool/dbconfig/20240715-220823-arnaudb.json
[22:08:26] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2138.codfw.wmnet with reason: Maintenance
[22:08:34] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[22:08:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2138.codfw.wmnet with reason: Maintenance
[22:08:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T367781)', diff saved to https://phabricator.wikimedia.org/P66551 and previous config saved to /var/cache/conftool/dbconfig/20240715-220845-arnaudb.json
[22:08:50] <wikibugs>	 (03PS5) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114)
[22:11:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T367781)', diff saved to https://phabricator.wikimedia.org/P66552 and previous config saved to /var/cache/conftool/dbconfig/20240715-221117-arnaudb.json
[22:11:24] <wikibugs>	 (03PS5) 10BCornwall: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069
[22:11:24] <wikibugs>	 (03PS6) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114)
[22:12:24] <wikibugs>	 (03CR) 10BCornwall: Add public suffix list module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall)
[22:15:06] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3235/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall)
[22:26:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P66553 and previous config saved to /var/cache/conftool/dbconfig/20240715-222624-arnaudb.json
[22:30:39] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:36:40] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#9983561 (10RobH)
[22:41:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P66554 and previous config saved to /var/cache/conftool/dbconfig/20240715-224131-arnaudb.json
[22:50:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T370062#9983576 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate for T362033
[22:53:29] <wikibugs>	 (03PS1) 10Dzahn: gerrit: switch firewall provider to nftables at role level [puppet] - 10https://gerrit.wikimedia.org/r/1054398
[22:56:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9983589 (10Jclark-ctr) 05Open→03Resolved
[22:56:16] <wikibugs>	 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9983583 (10JJMC89) https://gitlab.wi...
[22:56:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T367781)', diff saved to https://phabricator.wikimedia.org/P66555 and previous config saved to /var/cache/conftool/dbconfig/20240715-225639-arnaudb.json
[22:56:41] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2148.codfw.wmnet with reason: Maintenance
[22:56:43] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[22:56:54] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2148.codfw.wmnet with reason: Maintenance
[22:57:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T367781)', diff saved to https://phabricator.wikimedia.org/P66556 and previous config saved to /var/cache/conftool/dbconfig/20240715-225701-arnaudb.json
[22:59:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:59:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T367781)', diff saved to https://phabricator.wikimedia.org/P66557 and previous config saved to /var/cache/conftool/dbconfig/20240715-225933-arnaudb.json
[22:59:35] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111 (10RobH) 03NEW
[23:00:09] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#9983621 (10RobH)
[23:05:08] <wikibugs>	 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112 (10RobH) 03NEW
[23:05:29] <wikibugs>	 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#9983645 (10RobH)
[23:06:06] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#9983648 (10andrea.denisse) a:03andrea.denisse
[23:11:30] <logmsgbot>	 !log nshahquinn-wmf@deploy1002 Started deploy [airflow-dags/analytics_product@767d7ad]: (no justification provided)
[23:11:39] <logmsgbot>	 !log nshahquinn-wmf@deploy1002 Finished deploy [airflow-dags/analytics_product@767d7ad]: (no justification provided) (duration: 00m 08s)
[23:12:59] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#9983662 (10Jclark-ctr) You have successfully submitted request SR194058934.
[23:14:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P66558 and previous config saved to /var/cache/conftool/dbconfig/20240715-231440-arnaudb.json
[23:16:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9983669 (10JJMC89) >>! In T369314#9983460, @KFrancis wrote: > Hello @JJMC89, please send you full name and postal address to kfrancis@wikimedia.org and I'll get the NDA processed.  Thanks!...
[23:29:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P66559 and previous config saved to /var/cache/conftool/dbconfig/20240715-232947-arnaudb.json
[23:37:13] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[23:38:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054400
[23:38:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054400 (owner: 10TrainBranchBot)
[23:38:25] <wikibugs>	 (03PS1) 10Zabe: Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529)
[23:39:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe)
[23:39:47] <wikibugs>	 (03PS2) 10Zabe: Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529)
[23:39:58] <zabe>	 jouncebot: nowandnext
[23:39:58] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 20 minute(s)
[23:39:58] <jouncebot>	 In 2 hour(s) and 20 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240716T0200)
[23:41:05] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe)
[23:41:44] <wikibugs>	 (03Merged) 10jenkins-bot: Further configurations for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054401 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe)
[23:42:10] <logmsgbot>	 !log zabe@deploy1002 Started scap sync-world: Backport for [[gerrit:1054401|Further configurations for aewikimedia (T362529)]]
[23:42:14] <stashbot>	 T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529
[23:44:39] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:1054401|Further configurations for aewikimedia (T362529)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:44:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T367781)', diff saved to https://phabricator.wikimedia.org/P66560 and previous config saved to /var/cache/conftool/dbconfig/20240715-234454-arnaudb.json
[23:44:57] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2175.codfw.wmnet with reason: Maintenance
[23:44:58] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[23:45:10] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2175.codfw.wmnet with reason: Maintenance
[23:45:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T367781)', diff saved to https://phabricator.wikimedia.org/P66561 and previous config saved to /var/cache/conftool/dbconfig/20240715-234516-arnaudb.json
[23:47:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T367781)', diff saved to https://phabricator.wikimedia.org/P66562 and previous config saved to /var/cache/conftool/dbconfig/20240715-234748-arnaudb.json
[23:48:58] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php aewikimedia translate # T362529
[23:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:01] <stashbot>	 T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529
[23:49:11] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[23:49:28] <logmsgbot>	 !log zabe@deploy1002 zabe: Continuing with sync
[23:54:37] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1054401|Further configurations for aewikimedia (T362529)]] (duration: 12m 26s)
[23:54:41] <stashbot>	 T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529
[23:56:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[23:56:55] <wikibugs>	 (03CR) 10RLazarus: switchdc: prepare mediawiki cache warmup for bare-metal turndown (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053823 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French)