[00:00:10] (03CR) 10Dzahn: [C:03+2] installserver: add parsoidtest1001 to partman [puppet] - 10https://gerrit.wikimedia.org/r/1051856 (https://phabricator.wikimedia.org/T363399) (owner: 10Dzahn) [00:01:29] (03CR) 10Dzahn: [V:03+2 C:03+2] installserver: add parsoidtest1001 to partman [puppet] - 10https://gerrit.wikimedia.org/r/1051856 (https://phabricator.wikimedia.org/T363399) (owner: 10Dzahn) [00:02:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051854 (owner: 10TrainBranchBot) [00:05:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [00:07:47] (03PS1) 10Dzahn: set force_puppet7 for parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1051859 (https://phabricator.wikimedia.org/T363399) [00:08:01] (03CR) 10Dzahn: [V:03+2 C:03+2] set force_puppet7 for parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1051859 (https://phabricator.wikimedia.org/T363399) (owner: 10Dzahn) [00:09:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952481 (10Papaul) @Jclark-ctr i don't understand why this step was checked Update the operations/puppet repo - this should include updates to preseed.yaml, a... [00:11:34] (03PS1) 10Dzahn: site: add parsoidtest section [puppet] - 10https://gerrit.wikimedia.org/r/1051860 (https://phabricator.wikimedia.org/T363399) [00:12:39] (03CR) 10Dzahn: [C:03+2] site: add parsoidtest section [puppet] - 10https://gerrit.wikimedia.org/r/1051860 (https://phabricator.wikimedia.org/T363399) (owner: 10Dzahn) [00:15:10] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host parsoidtest1001.eqiad.wmnet with OS bullseye [00:15:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye [00:25:25] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parsoidtest1001.eqiad.wmnet with reason: host reimage [00:25:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [00:29:01] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parsoidtest1001.eqiad.wmnet with reason: host reimage [00:30:04] (03PS1) 10Dzahn: phabricator: add script to set safedir git config for deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) [00:33:45] (03CR) 10CI reject: [V:04-1] phabricator: add script to set safedir git config for deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [00:42:24] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dzahn@cumin1002" [00:42:35] (03PS2) 10Dzahn: phabricator: add script to set safedir git config for deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) [00:43:25] (03PS3) 10Dzahn: phabricator: add script to set safedir git config for deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) [00:43:55] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dzahn@cumin1002" [00:43:56] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parsoidtest1001.eqiad.wmnet with OS bullseye [00:44:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952509 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye completed:... [00:45:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952510 (10Dzahn) machine is now up and running with "insetup::serviceops" role. [00:45:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952511 (10Dzahn) [00:47:01] (03CR) 10CI reject: [V:04-1] phabricator: add script to set safedir git config for deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [00:48:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952512 (10Dzahn) @akosiaris The machine is now ready to get a production puppet role. If it's replacing `scandium`, then `role(parsoid::testing)` can be appli... [00:52:40] (03PS4) 10Dzahn: phabricator: add script to set safedir git config for deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) [00:53:02] (03CR) 10CI reject: [V:04-1] phabricator: add script to set safedir git config for deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [00:54:40] (03PS5) 10Dzahn: phabricator: add script to set safedir git config for deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) [00:57:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T367856)', diff saved to https://phabricator.wikimedia.org/P65777 and previous config saved to /var/cache/conftool/dbconfig/20240704-005750-marostegui.json [00:57:55] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:01:19] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1051861/3163/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [01:09:34] (03PS1) 10Dzahn: phabricator: qualify test command for 'unless' in Exec [puppet] - 10https://gerrit.wikimedia.org/r/1051862 (https://phabricator.wikimedia.org/T360756) [01:10:50] (03CR) 10Dzahn: [C:03+2] "replacing with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051861" [puppet] - 10https://gerrit.wikimedia.org/r/1051852 (owner: 10Dzahn) [01:11:03] (03CR) 10Dzahn: [C:03+2] phabricator: qualify test command for 'unless' in Exec [puppet] - 10https://gerrit.wikimedia.org/r/1051862 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [01:11:59] (03CR) 10Dzahn: [V:03+2 C:03+2] phabricator: qualify test command for 'unless' in Exec [puppet] - 10https://gerrit.wikimedia.org/r/1051862 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [01:12:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P65778 and previous config saved to /var/cache/conftool/dbconfig/20240704-011258-marostegui.json [01:22:51] (03CR) 10Dzahn: [C:03+2] "works after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051862" [puppet] - 10https://gerrit.wikimedia.org/r/1051861 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [01:25:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:28:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P65779 and previous config saved to /var/cache/conftool/dbconfig/20240704-012806-marostegui.json [01:34:13] PROBLEM - Host an-worker1164 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:43] RECOVERY - Host an-worker1164 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [01:43:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T367856)', diff saved to https://phabricator.wikimedia.org/P65780 and previous config saved to /var/cache/conftool/dbconfig/20240704-014313-marostegui.json [01:43:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [01:43:18] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:43:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [02:09:50] (03PS1) 10Dzahn: phabricator: change "unless" command for generating git safedir config [puppet] - 10https://gerrit.wikimedia.org/r/1051864 (https://phabricator.wikimedia.org/T360756) [02:10:44] (03CR) 10Dzahn: [C:03+2] phabricator: change "unless" command for generating git safedir config [puppet] - 10https://gerrit.wikimedia.org/r/1051864 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [02:17:00] (03Abandoned) 10Dzahn: phabricator: configure git safedir for all directories [puppet] - 10https://gerrit.wikimedia.org/r/1049637 (https://phabricator.wikimedia.org/T360756) (owner: 10Dzahn) [02:26:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T364069)', diff saved to https://phabricator.wikimedia.org/P65781 and previous config saved to /var/cache/conftool/dbconfig/20240704-022608-marostegui.json [02:26:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:31:09] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_drmrs [02:33:32] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_drmrs [02:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:26] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:41:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P65782 and previous config saved to /var/cache/conftool/dbconfig/20240704-024115-marostegui.json [02:43:34] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster [02:56:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P65783 and previous config saved to /var/cache/conftool/dbconfig/20240704-025622-marostegui.json [02:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T364069)', diff saved to https://phabricator.wikimedia.org/P65784 and previous config saved to /var/cache/conftool/dbconfig/20240704-031129-marostegui.json [03:11:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [03:11:33] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:11:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [03:11:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T364069)', diff saved to https://phabricator.wikimedia.org/P65785 and previous config saved to /var/cache/conftool/dbconfig/20240704-031151-marostegui.json [03:24:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:49:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:28:03] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [04:30:21] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [04:31:03] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Swift [04:31:21] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [04:34:21] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [04:35:21] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [04:41:03] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [04:42:03] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Swift [04:44:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T369020 [04:44:24] T369020: Switchover s6 master (db1231 -> db1173) - https://phabricator.wikimedia.org/T369020 [04:44:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1173 with weight 0 T369020', diff saved to https://phabricator.wikimedia.org/P65786 and previous config saved to /var/cache/conftool/dbconfig/20240704-044429-marostegui.json [04:44:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T369020 [04:45:04] (03PS2) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1051326 (https://phabricator.wikimedia.org/T369020) [04:45:08] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1051325 (https://phabricator.wikimedia.org/T369020) (owner: 10Gerrit maintenance bot) [04:48:03] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [04:50:03] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Swift [04:56:03] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [04:57:03] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Swift [05:00:40] (03CR) 10Eileen: [C:03+1] "Yes - I can confirm this resolved the issue in our testing" [puppet] - 10https://gerrit.wikimedia.org/r/1051851 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [05:01:58] !log Starting s6 eqiad failover from db1231 to db1173 - T369020 [05:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:01] T369020: Switchover s6 master (db1231 -> db1173) - https://phabricator.wikimedia.org/T369020 [05:02:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T369020', diff saved to https://phabricator.wikimedia.org/P65787 and previous config saved to /var/cache/conftool/dbconfig/20240704-050216-marostegui.json [05:02:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1173 to s6 primary and set section read-write T369020', diff saved to https://phabricator.wikimedia.org/P65788 and previous config saved to /var/cache/conftool/dbconfig/20240704-050237-marostegui.json [05:03:08] (03CR) 10Marostegui: [C:03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1051326 (https://phabricator.wikimedia.org/T369020) (owner: 10Gerrit maintenance bot) [05:03:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231 T369020', diff saved to https://phabricator.wikimedia.org/P65789 and previous config saved to /var/cache/conftool/dbconfig/20240704-050334-marostegui.json [05:08:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Long schema change [05:08:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Long schema change [05:11:35] !log Deploy schema change on db1231 s6 eqiad dbmaint T367856 [05:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:37] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:12:21] (03PS1) 10Marostegui: db1231: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1051868 [05:13:37] (03CR) 10Marostegui: [C:03+2] db1231: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1051868 (owner: 10Marostegui) [05:19:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:56:23] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 230684464 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:57:23] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T0600) [06:00:05] marostegui, Amir1, and arnaudb: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:11:50] (03CR) 10Kevin Bazira: [C:03+1] ml-services: deploy gemma2-27b-it on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051806 (https://phabricator.wikimedia.org/T369055) (owner: 10Ilias Sarantopoulos) [06:13:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051798 (owner: 10DCausse) [06:15:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T364069)', diff saved to https://phabricator.wikimedia.org/P65790 and previous config saved to /var/cache/conftool/dbconfig/20240704-061517-marostegui.json [06:15:20] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:30:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P65791 and previous config saved to /var/cache/conftool/dbconfig/20240704-063024-marostegui.json [06:45:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P65792 and previous config saved to /var/cache/conftool/dbconfig/20240704-064531-marostegui.json [06:50:02] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9952699 (10Volans) 05Resolved→03Open @AndyRussG Sorry to bother you again, I had forgot one small detail yesterday. As you now have an external... [06:54:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T0700). [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:26] o/ I can deploy [07:00:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T364069)', diff saved to https://phabricator.wikimedia.org/P65793 and previous config saved to /var/cache/conftool/dbconfig/20240704-070038-marostegui.json [07:00:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [07:00:44] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:00:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [07:01:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T364069)', diff saved to https://phabricator.wikimedia.org/P65794 and previous config saved to /var/cache/conftool/dbconfig/20240704-070100-marostegui.json [07:01:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051798 (owner: 10DCausse) [07:01:58] (03Merged) 10jenkins-bot: cirrus: re-enable search updates on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051798 (owner: 10DCausse) [07:02:53] !log dcausse@deploy1002 Started scap sync-world: Backport for [[gerrit:1051798|cirrus: re-enable search updates on wikitech]] [07:05:33] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:1051798|cirrus: re-enable search updates on wikitech]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:06:14] !log dcausse@deploy1002 dcausse: Continuing with sync [07:09:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:10:23] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52339 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:21] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1051798|cirrus: re-enable search updates on wikitech]] (duration: 08m 28s) [07:11:50] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy gemma2-27b-it on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051806 (https://phabricator.wikimedia.org/T369055) (owner: 10Ilias Sarantopoulos) [07:12:46] (03Merged) 10jenkins-bot: ml-services: deploy gemma2-27b-it on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051806 (https://phabricator.wikimedia.org/T369055) (owner: 10Ilias Sarantopoulos) [07:15:53] !log refreshing the wikitech search indices [07:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:21] !log closing the backport window [07:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:35:40] !log root@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1009.eqiad.wmnet [07:37:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:41:52] !log root@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1009.eqiad.wmnet [07:41:57] (03PS1) 10Ayounsi: PuppetDB import: don't treat /32-/128 VM interfaces as VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1052045 (https://phabricator.wikimedia.org/T367265) [07:42:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:42:37] (03PS2) 10Ayounsi: PuppetDB import: don't treat /32-/128 VM interfaces as VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1052045 (https://phabricator.wikimedia.org/T367265) [07:44:07] (03CR) 10Ayounsi: [V:03+1] "Tested on Netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1052045 (https://phabricator.wikimedia.org/T367265) (owner: 10Ayounsi) [07:44:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:45:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:47:23] (03CR) 10Alexandros Kosiaris: [C:03+1] wikifunctions: Raise CPU limit in orchestrator from 200m to 400m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051813 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester) [07:49:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:52:56] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1009.eqiad.wmnet [07:52:57] !log root@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cloudcephosd1009.eqiad.wmnet [07:57:54] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: lock backups during upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/1051107 (https://phabricator.wikimedia.org/T367501) (owner: 10Jelto) [07:59:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:00:04] hashar and jeena: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T0800) [08:01:07] (03CR) 10Elukey: [V:03+2 C:03+2] knative: upgrade all images to Bookworm and Golang 1.22 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [08:01:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:01:24] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1052045 (https://phabricator.wikimedia.org/T367265) (owner: 10Ayounsi) [08:01:30] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: lock backups during upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/1051107 (https://phabricator.wikimedia.org/T367501) (owner: 10Jelto) [08:01:45] !log start rebooting A:cp-eqiad (upload|text in parallel) for T366555 [08:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:45] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqiad [08:02:46] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqiad [08:05:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:06:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:10:47] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:11:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:11:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:16:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:17:34] I am going to deploy the train :D [08:17:49] 🚄 [08:18:50] (03CR) 10Ayounsi: [V:03+1 C:03+2] PuppetDB import: don't treat /32-/128 VM interfaces as VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1052045 (https://phabricator.wikimedia.org/T367265) (owner: 10Ayounsi) [08:19:43] (03Merged) 10jenkins-bot: PuppetDB import: don't treat /32-/128 VM interfaces as VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1052045 (https://phabricator.wikimedia.org/T367265) (owner: 10Ayounsi) [08:19:45] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052047 (https://phabricator.wikimedia.org/T366957) [08:19:46] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052047 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [08:20:28] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052047 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [08:22:20] (03CR) 10Elukey: PuppetDB import: don't treat /32-/128 VM interfaces as VIPs (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1052045 (https://phabricator.wikimedia.org/T367265) (owner: 10Ayounsi) [08:24:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:24:49] !log temporary disable puppet on A:cp-ulsfo to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051198 (T365718) [08:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:52] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [08:25:10] hashar: o/ [08:25:41] that Memcached alert is a bit worrying, I just noticed it, do we know what is it? [08:25:41] (03CR) 10Fabfur: [C:03+2] benthos:cache: encode problematic fields as b64url [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [08:25:53] if not let's pause for a sec before deploying the train [08:27:41] (03PS4) 10Elukey: wmfdebug: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366) [08:27:41] (03PS1) 10Elukey: haproxy: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1052049 (https://phabricator.wikimedia.org/T369144) [08:28:17] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.12 refs T366957 [08:28:20] T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957 [08:28:32] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [08:29:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:30:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:31:28] (03PS1) 10Ilias Sarantopoulos: ml-services: update hf image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052051 (https://phabricator.wikimedia.org/T357986) [08:32:49] (03PS2) 10Ilias Sarantopoulos: ml-services: update hf image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052051 (https://phabricator.wikimedia.org/T369055) [08:33:34] memcached error rate ? :/ [08:34:29] (03PS1) 10Fabfur: benthos:cache: fix typo in bloblang assignment [puppet] - 10https://gerrit.wikimedia.org/r/1052052 (https://phabricator.wikimedia.org/T365718) [08:34:30] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:34:48] hmm pages from the Draft namepsace are emitting `PHP Notice: Undefined offset: 0` [08:34:53] so Iam gonna happily rollback [08:34:56] cause it is easy ;) [08:36:03] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052053 (https://phabricator.wikimedia.org/T366957) [08:36:05] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052053 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [08:36:44] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052053 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [08:39:19] (03CR) 10Fabfur: [C:03+2] benthos:cache: fix typo in bloblang assignment [puppet] - 10https://gerrit.wikimedia.org/r/1052052 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [08:43:15] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [08:44:00] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [08:44:13] pff [08:44:31] I am using `scap train` to rollback and somehow it ends up stuck on rebuilding and pushing the images [08:44:32] :( [08:44:33] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update hf image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052051 (https://phabricator.wikimedia.org/T369055) (owner: 10Ilias Sarantopoulos) [08:44:57] hashar: to be clear, the memcached errors seems unrelated to the Draft namespace issue [08:45:16] image build: real 2m38.372s [08:45:17] :/ [08:45:21] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1009.eqiad.wmnet with OS bullseye [08:45:23] elukey: ah cool! [08:45:33] !log enable puppet on A:cp-ulsfo (T365718) [08:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:36] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [08:45:42] I will file a task next about the php notice [08:46:36] (03CR) 10Ilias Sarantopoulos: [C:03+2] "Thanks for the review Kevin!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052051 (https://phabricator.wikimedia.org/T369055) (owner: 10Ilias Sarantopoulos) [08:46:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:47:18] I have cancelled the rollback [08:47:31] cause surely rebuilding a FULL IMAGE for changing a single json file is not sustainable [08:47:33] (03Merged) 10jenkins-bot: ml-services: update hf image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052051 (https://phabricator.wikimedia.org/T369055) (owner: 10Ilias Sarantopoulos) [08:47:36] i will do it the old way [08:51:05] roll back is ongoing [08:56:44] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:58:55] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9953045 (10Ladsgroup) >>! In T368098#9951531, @xcollazo wrote: > That single r... [08:59:16] !log restart mcrouter on mwmaint1002 [08:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:00] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1009.eqiad.wmnet with reason: host reimage [09:00:50] (03CR) 10JMeybohm: [C:03+1] cumin: fix kube-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1051365 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [09:03:18] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1009.eqiad.wmnet with reason: host reimage [09:04:05] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:05:34] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.43.0-wmf.12" - T366957 [09:05:37] T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957 [09:06:11] blocker filed as T369260 [09:06:12] T369260: PHP Notice: Undefined offset: 0 - https://phabricator.wikimedia.org/T369260 [09:09:14] (03PS1) 10Ayounsi: Upgrade Netbox dev to 4.0.6 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1052054 (https://phabricator.wikimedia.org/T365989) [09:10:41] (03PS1) 10Jelto: gitlab_runner: add default for dockerfile frontend in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1052057 [09:11:17] (03CR) 10Slyngshede: [C:03+1] "Looks good" [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1052054 (https://phabricator.wikimedia.org/T365989) (owner: 10Ayounsi) [09:13:02] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:14:01] (03PS1) 10Ladsgroup: Reduce frequency of two query pages in commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) [09:14:03] (03CR) 10Jelto: [C:03+2] gitlab_runner: add default for dockerfile frontend in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1052057 (owner: 10Jelto) [09:16:41] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old sretest2005 IP - ayounsi@cumin1002" [09:17:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old sretest2005 IP - ayounsi@cumin1002" [09:17:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:20:11] (03PS3) 10Clément Goubert: Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) [09:20:31] (03PS3) 10Clément Goubert: service.yaml: Switch api and appserver to lvs_setup 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050381 (https://phabricator.wikimedia.org/T367949) [09:20:36] (03CR) 10Elukey: "The image builds fine, we'll probably need to test thumbor in staging to see if the were major changes between 2.4 (current on buster) and" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1052049 (https://phabricator.wikimedia.org/T369144) (owner: 10Elukey) [09:20:56] (03PS3) 10Clément Goubert: Remove legacy appservers from profile::lvs::realserver::pools 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) [09:21:13] (03PS2) 10Clément Goubert: Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) [09:21:13] (03PS1) 10David Caro: cloudcephosd1009: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1052060 (https://phabricator.wikimedia.org/T309789) [09:21:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [09:22:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [09:22:03] (03CR) 10David Caro: [C:03+2] cloudcephosd1009: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1052060 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro) [09:23:20] !log Manual cleanup of puppet certs for renamed servers mw1417.eqiad.wmnet mw1418.eqiad.wmnet mw2300.codfw.wmnet [09:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:57] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1009.eqiad.wmnet with OS bullseye [09:24:46] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1009.eqiad.wmnet [09:29:28] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1009.eqiad.wmnet [09:29:32] (03CR) 10Effie Mouzeli: [C:03+2] cumin: fix kube-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1051365 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [09:30:00] (03PS1) 10Elukey: admin_ng: upgrade Knative Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052061 (https://phabricator.wikimedia.org/T368359) [09:33:51] (03PS2) 10Elukey: admin_ng: upgrade Knative Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052061 (https://phabricator.wikimedia.org/T368359) [09:34:05] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:37:55] (03CR) 10Elukey: "Hi folks! I think we can start from ml-staging and let it bake for a few days, and the proceed to prod. How does it sound?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052061 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [09:40:52] (03CR) 10Cathal Mooney: Update aggregate route creation policy for network pops (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [09:42:05] (03PS1) 10Filippo Giunchedi: prometheus: remove fr globalcollect endpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) [09:42:13] 10ops-eqiad, 06DC-Ops: hw troubleshooting: DIMM in slot B1 of an-presto1004 is no longer detected - https://phabricator.wikimedia.org/T369265 (10BTullis) 03NEW [09:42:34] (03CR) 10Filippo Giunchedi: "I don't know if there's an alternative endpoint we should be probing instead?" [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi) [09:42:50] 10ops-eqiad, 06DC-Ops: hw troubleshooting: DIMM in slot B1 of an-presto1004 is no longer detected - https://phabricator.wikimedia.org/T369265#9953198 (10BTullis) [09:42:51] (03CR) 10Cathal Mooney: Update aggregate route creation policy for network pops (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [09:42:56] (03CR) 10Cathal Mooney: [C:03+2] Update aggregate route creation policy for network pops [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [09:43:02] (03PS1) 10Superpes15: [arbcom_itwiki] Enable importing from itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052063 (https://phabricator.wikimedia.org/T369264) [09:43:48] (03Merged) 10jenkins-bot: Update aggregate route creation policy for network pops [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [09:50:33] (03PS5) 10Superpes15: Removing 'spamblacklistlog' right from usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049500 (https://phabricator.wikimedia.org/T367683) [09:51:32] (03CR) 10Elukey: "Hi Keith! I am not 100% onboard with this, for the following reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [09:53:32] !log Pushing updated BGP policy to cr2-eqord in Chiacago to re-announce codfw IP ranges there T367439 [09:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:35] T367439: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439 [09:55:34] (03PS1) 10Vgutierrez: haproxy,hiera: Test bwlimit per url on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) [09:57:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [09:58:27] (03CR) 10Filippo Giunchedi: "Alerts themselves LGTM, though I believe the thresholds will need some adjustments, looking at https://w.wiki/AZXc these would result in w" [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [09:58:53] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9953260 (10cmooney) Ok change merged, we are now announcing codfw ranges from eqord again: ` cmooney@cr2-eqord> show route advertising-protocol b... [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1000) [10:03:32] (03PS2) 10Vgutierrez: haproxy,hiera: Test bwlimit per url on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) [10:04:12] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [10:06:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T364069)', diff saved to https://phabricator.wikimedia.org/P65797 and previous config saved to /var/cache/conftool/dbconfig/20240704-100622-marostegui.json [10:06:26] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:07:43] (03CR) 10Ladsgroup: "ping 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [10:21:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P65798 and previous config saved to /var/cache/conftool/dbconfig/20240704-102129-marostegui.json [10:29:47] (03PS1) 10D3r1ck01: PermissionManager: Handle empty error array from TitleQuickPermissions [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052069 (https://phabricator.wikimedia.org/T369260) [10:33:53] (03PS1) 10Giuseppe Lavagetto: mwscript: add safeguard against running on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1052075 [10:36:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P65799 and previous config saved to /var/cache/conftool/dbconfig/20240704-103636-marostegui.json [10:40:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9953434 (10cmooney) >>! In T360789#9941103, @Papaul wrote: > All the cabling is done. I am leaving this task open so when we move the console cables from as... [10:45:39] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9953449 (10cmooney) @papaul sorry I meant to get back to you sooner. I've made decent progress on T369106 and managed to test reimage working ok in one of... [10:45:40] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9953452 (10cmooney) [10:46:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:47:21] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9953455 (10cmooney) [10:51:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T364069)', diff saved to https://phabricator.wikimedia.org/P65800 and previous config saved to /var/cache/conftool/dbconfig/20240704-105143-marostegui.json [10:51:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [10:51:47] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:51:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [10:52:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T364069)', diff saved to https://phabricator.wikimedia.org/P65801 and previous config saved to /var/cache/conftool/dbconfig/20240704-105205-marostegui.json [10:53:49] (03PS1) 10Elukey: role::builder: add mcrouter uid for docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1052080 (https://phabricator.wikimedia.org/T368366) [10:54:00] jouncebot: now [10:54:00] For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1000) [10:54:07] jouncebot: next [10:54:07] In 1 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1200) [10:55:59] (03CR) 10Elukey: [C:03+2] role::builder: add mcrouter uid for docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1052080 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:57:37] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [10:59:26] (03CR) 10Fabfur: [C:03+1] "overall looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [11:10:37] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [11:11:34] (03PS1) 10Effie Mouzeli: mw-mcrouter: rollout to eqiad api-ext and api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052082 (https://phabricator.wikimedia.org/T346690) [11:12:22] (03PS1) 10Marostegui: db1213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1052083 (https://phabricator.wikimedia.org/T369250) [11:13:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1213 db1185 T369250', diff saved to https://phabricator.wikimedia.org/P65802 and previous config saved to /var/cache/conftool/dbconfig/20240704-111324-root.json [11:13:27] (03CR) 10Marostegui: [C:03+2] db1213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1052083 (https://phabricator.wikimedia.org/T369250) (owner: 10Marostegui) [11:13:27] T369250: db1213 InnoDB errors - https://phabricator.wikimedia.org/T369250 [11:14:13] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1185.eqiad.wmnet onto db1213.eqiad.wmnet [11:14:37] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [11:17:11] (03PS1) 10Slyngshede: MediaWiki: Allow Bitu to be used as a 2FA proxy. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052085 (https://phabricator.wikimedia.org/T359551) [11:18:17] (03CR) 10CI reject: [V:04-1] MediaWiki: Allow Bitu to be used as a 2FA proxy. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052085 (https://phabricator.wikimedia.org/T359551) (owner: 10Slyngshede) [11:19:01] 06SRE, 06Infrastructure-Foundations, 10netops: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274 (10cmooney) 03NEW p:05Triage→03Medium [11:19:31] 06SRE, 06Infrastructure-Foundations, 10netops: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274#9953512 (10cmooney) [11:19:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9953513 (10cmooney) [11:20:02] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9953514 (10cmooney) [11:20:04] 06SRE, 06Infrastructure-Foundations, 10netops: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274#9953515 (10cmooney) [11:20:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9953516 (10cmooney) [11:23:58] (03PS1) 10Cathal Mooney: Announce Anycast ranges from Network POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1052086 (https://phabricator.wikimedia.org/T367439) [11:26:55] (03CR) 10Hashar: [C:03+2] PermissionManager: Handle empty error array from TitleQuickPermissions [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052069 (https://phabricator.wikimedia.org/T369260) (owner: 10D3r1ck01) [11:27:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9953540 (10cmooney) All is working well on the test-host. Well puppet was giving me a headache but I just skipped all that :) Reimage works an... [11:27:32] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9953535 (10cmooney) 05Open→03Resolved All seems good with the policy changes now, closing task. [11:30:25] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:32:43] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-mcrouter: rollout to eqiad api-ext and api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052082 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:35:07] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: rollout to eqiad api-ext and api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052082 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:36:00] (03Merged) 10jenkins-bot: mw-mcrouter: rollout to eqiad api-ext and api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052082 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:36:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:39:26] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:39:37] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [11:40:57] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:43:37] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [11:45:34] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:46:46] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:47:19] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-mcrouter: rollout to eqiad api-ext and api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052082 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:49:18] (03Merged) 10jenkins-bot: PermissionManager: Handle empty error array from TitleQuickPermissions [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052069 (https://phabricator.wikimedia.org/T369260) (owner: 10D3r1ck01) [11:54:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1185.eqiad.wmnet onto db1213.eqiad.wmnet [11:55:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [11:55:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [11:55:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T367856)', diff saved to https://phabricator.wikimedia.org/P65803 and previous config saved to /var/cache/conftool/dbconfig/20240704-115522-marostegui.json [11:55:25] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:55:38] I am rolling https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1052069 [11:56:11] !log hashar@deploy1002 Started scap sync-world: Backport for [[gerrit:1052069|PermissionManager: Handle empty error array from TitleQuickPermissions (T369260)]] [11:56:14] T369260: PHP Notice: Undefined offset: 0 - https://phabricator.wikimedia.org/T369260 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1200) [12:00:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:00:43] (03PS1) 10Kevin Bazira: ml-services: fix MAX_FEATURE_VALS path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052089 (https://phabricator.wikimedia.org/T368875) [12:02:01] (03PS2) 10Slyngshede: MediaWiki: Allow Bitu to be used as a 2FA proxy. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052085 (https://phabricator.wikimedia.org/T359551) [12:02:34] !log hashar@deploy1002 hashar, d3r1ck01: Backport for [[gerrit:1052069|PermissionManager: Handle empty error array from TitleQuickPermissions (T369260)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:02:34] !log hashar@deploy1002 Sync cancelled. [12:02:37] T369260: PHP Notice: Undefined offset: 0 - https://phabricator.wikimedia.org/T369260 [12:03:02] ? [12:03:14] pff stupid default [12:03:44] !log hashar@deploy1002 Started scap sync-world: Backport for [[gerrit:1052069|PermissionManager: Handle empty error array from TitleQuickPermissions (T369260)]] [12:05:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:06:19] !log hashar@deploy1002 hashar, d3r1ck01: Backport for [[gerrit:1052069|PermissionManager: Handle empty error array from TitleQuickPermissions (T369260)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:06:26] !log hashar@deploy1002 hashar, d3r1ck01: Continuing with sync [12:07:17] PROBLEM - Check whether ferm is active by checking the default input chain on mw1377 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:10:37] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [12:11:30] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1052069|PermissionManager: Handle empty error array from TitleQuickPermissions (T369260)]] (duration: 07m 45s) [12:11:32] T369260: PHP Notice: Undefined offset: 0 - https://phabricator.wikimedia.org/T369260 [12:14:37] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [12:15:06] (03CR) 10Slyngshede: "Documentation is pending and will come in a follow up patch." [software/bitu] - 10https://gerrit.wikimedia.org/r/1052085 (https://phabricator.wikimedia.org/T359551) (owner: 10Slyngshede) [12:16:09] (03CR) 10Slyngshede: "Cc'ing Andrew for awareness, as he might be stuck with the task of replacing the current token validation in other systems." [software/bitu] - 10https://gerrit.wikimedia.org/r/1052085 (https://phabricator.wikimedia.org/T359551) (owner: 10Slyngshede) [12:16:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65804 and previous config saved to /var/cache/conftool/dbconfig/20240704-121621-root.json [12:16:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65805 and previous config saved to /var/cache/conftool/dbconfig/20240704-121631-root.json [12:18:27] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 346.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:21:24] (03PS1) 10Marostegui: Revert "db1213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1052100 [12:21:52] (03CR) 10Marostegui: [C:03+2] Revert "db1213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1052100 (owner: 10Marostegui) [12:22:08] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9953674 (10SGupta-WMF) @Scott_French We have the new image here - https://gitl... [12:22:15] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052102 [12:27:47] I am resuming the MediaWiki train [12:27:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1213', diff saved to https://phabricator.wikimedia.org/P65806 and previous config saved to /var/cache/conftool/dbconfig/20240704-122752-root.json [12:28:10] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052107 (https://phabricator.wikimedia.org/T366957) [12:28:11] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052107 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [12:28:37] 👀 [12:28:48] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052107 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [12:31:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65807 and previous config saved to /var/cache/conftool/dbconfig/20240704-123111-root.json [12:31:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65808 and previous config saved to /var/cache/conftool/dbconfig/20240704-123127-root.json [12:36:31] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.12 refs T366957 [12:36:34] T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957 [12:37:17] RECOVERY - Check whether ferm is active by checking the default input chain on mw1377 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:40:17] (03CR) 10Klausman: [C:03+1] "Sounds good to me! I can also take over for prod if you have more pressing matters to attend to." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052061 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [12:40:27] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:40:37] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [12:43:24] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) (owner: 10Arturo Borrero Gonzalez) [12:44:07] (03CR) 10Ayounsi: [C:03+1] Announce Anycast ranges from Network POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1052086 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [12:46:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65810 and previous config saved to /var/cache/conftool/dbconfig/20240704-124617-root.json [12:46:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65811 and previous config saved to /var/cache/conftool/dbconfig/20240704-124632-root.json [12:48:16] (03PS1) 10Ayounsi: Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) [12:50:35] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [12:51:24] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Thanks for updating the images Luca!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052061 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [12:53:56] (03PS2) 10Ayounsi: Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) [12:55:42] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [12:55:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T367856)', diff saved to https://phabricator.wikimedia.org/P65812 and previous config saved to /var/cache/conftool/dbconfig/20240704-125543-marostegui.json [12:55:48] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1300). Please do the needful. [13:00:05] DreamRimmer and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] (03PS1) 10Btullis: Disable auto restart of the karapace service [puppet] - 10https://gerrit.wikimedia.org/r/1052110 (https://phabricator.wikimedia.org/T363461) [13:00:30] (03CR) 10Elukey: [C:03+2] admin_ng: upgrade Knative Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052061 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [13:00:53] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3165/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052110 (https://phabricator.wikimedia.org/T363461) (owner: 10Btullis) [13:01:01] (03CR) 10Alexandros Kosiaris: "I think we can abandon this?" [puppet] - 10https://gerrit.wikimedia.org/r/541209 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [13:01:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65813 and previous config saved to /var/cache/conftool/dbconfig/20240704-130122-root.json [13:01:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65814 and previous config saved to /var/cache/conftool/dbconfig/20240704-130137-root.json [13:02:39] (03CR) 10Ayounsi: [V:03+1] "PCC happy" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:04:05] I can’t deploy yet but probably in 5-15 minutes [13:04:34] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1052110 (https://phabricator.wikimedia.org/T363461) (owner: 10Btullis) [13:05:38] (03PS2) 10Alexandros Kosiaris: deployment::rsync: Remove long absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1051772 (https://phabricator.wikimedia.org/T364417) [13:05:38] (03PS1) 10Alexandros Kosiaris: deployment::rsync: Add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) [13:05:50] (03CR) 10Btullis: [V:03+1 C:03+2] Disable auto restart of the karapace service [puppet] - 10https://gerrit.wikimedia.org/r/1052110 (https://phabricator.wikimedia.org/T363461) (owner: 10Btullis) [13:06:06] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:07:29] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:08:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:08:10] (03PS3) 10Majavah: Drop deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) [13:09:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:09:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:10:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P65815 and previous config saved to /var/cache/conftool/dbconfig/20240704-131050-marostegui.json [13:11:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:11:51] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:12:25] (03CR) 10Ssingh: "I am curious: it was working for the non-Ganeti case because th" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:12:50] (03CR) 10Ssingh: "Redundant comment, summarized above." [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:14:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9953790 (10Miriam) @dzahn sure! Are there guidelines for approval that I can/should follow? [13:15:17] (03CR) 10Alexandros Kosiaris: [C:03+2] deployment::rsync: Remove long absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1051772 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:15:43] (03CR) 10JMeybohm: [C:03+1] docker::reporter: update exclude rules [puppet] - 10https://gerrit.wikimedia.org/r/1051379 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [13:16:04] alright, I can deploy! [13:16:11] (03CR) 10JMeybohm: [C:03+1] wmfdebug: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [13:16:20] DreamRimmer, Superpes: are you around? [13:16:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65816 and previous config saved to /var/cache/conftool/dbconfig/20240704-131628-root.json [13:16:44] (03CR) 10JMeybohm: [C:03+1] haproxy: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1052049 (https://phabricator.wikimedia.org/T369144) (owner: 10Elukey) [13:16:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65817 and previous config saved to /var/cache/conftool/dbconfig/20240704-131643-root.json [13:20:12] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:20:54] !log andrewtavis-wmde@deploy1002 Started deploy [airflow-dags/wmde@d773cac]: (no justification provided) [13:20:57] !log andrewtavis-wmde@deploy1002 Finished deploy [airflow-dags/wmde@d773cac]: (no justification provided) (duration: 00m 03s) [13:24:14] (03PS2) 10Alexandros Kosiaris: deployment::rsync: Add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) [13:25:05] (03CR) 10Vgutierrez: [C:03+1] trafficserver::lua_script: Implement ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:25:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P65818 and previous config saved to /var/cache/conftool/dbconfig/20240704-132558-marostegui.json [13:26:07] (03CR) 10Fabfur: [C:03+1] trafficserver::lua_script: Implement ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:28:43] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:29:49] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah) [13:31:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65819 and previous config saved to /var/cache/conftool/dbconfig/20240704-133133-root.json [13:31:37] (03CR) 10Alexandros Kosiaris: [C:03+2] "Missed this one, sorry. Thanks, merging" [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah) [13:31:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65820 and previous config saved to /var/cache/conftool/dbconfig/20240704-133150-root.json [13:32:24] (03PS1) 10Jelto: gitlab_runner: add default for buildkit frontend in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1052122 [13:32:43] !log disabling puppet on P:trafficserver::backend to merge 1050293 - T367949 [13:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:46] T367949: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [13:33:38] (03CR) 10Jelto: [C:03+2] gitlab_runner: add default for buildkit frontend in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1052122 (owner: 10Jelto) [13:34:04] I should be able to verify the dewiki change myself so I guess I can deploy that [13:34:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051809 (https://phabricator.wikimedia.org/T368900) (owner: 10Dreamrimmer) [13:34:47] (03PS2) 10Dreamrimmer: Remove "Create a book" link from sidebar on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051809 (https://phabricator.wikimedia.org/T368900) [13:34:55] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051809 (https://phabricator.wikimedia.org/T368900) (owner: 10Dreamrimmer) [13:34:57] (03CR) 10Ayounsi: [V:03+1] "So far we can only see it as a bug from Bird 😞" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:35:41] (03Merged) 10jenkins-bot: Remove "Create a book" link from sidebar on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051809 (https://phabricator.wikimedia.org/T368900) (owner: 10Dreamrimmer) [13:36:00] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1051809|Remove "Create a book" link from sidebar on German Wikipedia (T368900)]] [13:36:00] !log Enabling puppet on cp6016.drmrs.wmnet to test 1050293 - T367949 [13:36:02] T368900: Remove "Create a book" link from sidebar on German Wikipedia - https://phabricator.wikimedia.org/T368900 [13:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:36] (03CR) 10Clément Goubert: [C:03+2] trafficserver::lua_script: Implement ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1050293 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:38:35] !log lucaswerkmeister-wmde@deploy1002 dreamrimmer, lucaswerkmeister-wmde: Backport for [[gerrit:1051809|Remove "Create a book" link from sidebar on German Wikipedia (T368900)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:39:22] requires a Ctrl+F5 for some reason, but works \o/ [13:39:25] !log lucaswerkmeister-wmde@deploy1002 dreamrimmer, lucaswerkmeister-wmde: Continuing with sync [13:39:44] hi DreamRimmer! I started deploying the dewiki config change already since I thought I could verify it myself [13:39:53] Thanks [13:40:18] (03PS8) 10AOkoth: vtrs: upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) [13:41:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T367856)', diff saved to https://phabricator.wikimedia.org/P65821 and previous config saved to /var/cache/conftool/dbconfig/20240704-134105-marostegui.json [13:41:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:41:08] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:41:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:41:49] !log Enabling and running puppet on P:trafficserver::backend to merge 1050293 - T367949 [13:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:52] T367949: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [13:44:11] (03CR) 10CI reject: [V:04-1] vtrs: upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [13:44:35] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1051809|Remove "Create a book" link from sidebar on German Wikipedia (T368900)]] (duration: 08m 35s) [13:44:38] T368900: Remove "Create a book" link from sidebar on German Wikipedia - https://phabricator.wikimedia.org/T368900 [13:44:41] (03CR) 10AOkoth: vtrs: upgrade cookbook (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [13:45:27] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 407.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:46:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65822 and previous config saved to /var/cache/conftool/dbconfig/20240704-134639-root.json [13:46:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65823 and previous config saved to /var/cache/conftool/dbconfig/20240704-134656-root.json [13:48:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T364069)', diff saved to https://phabricator.wikimedia.org/P65824 and previous config saved to /var/cache/conftool/dbconfig/20240704-134806-marostegui.json [13:48:10] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:49:36] DreamRimmer: shall we go ahead with deleting the Hiera/Heira namespace? [13:49:56] jouncebot: next [13:49:57] In 1 hour(s) and 10 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1500) [13:50:12] (03PS1) 10Ssingh: sre.cdn.roll-reboot: log host to SAL post reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 [13:51:01] Lucas_WMDE: sure [13:51:18] although I just looked at the diffConfig and it’s confusing me [13:51:20] https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/1187/console [13:51:31] what are those added "499": "true" lines doing there? o_O [13:52:27] let me see if I can reproduce the diff locally, so I can get more context lines… [13:53:30] OK [13:53:31] !log disabling puppet on P:trafficserver::backend to merge 1049507 - T367949 [13:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:34] T367949: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [13:55:08] okay, the difference is in wgNamespacesWithSubpages [13:55:23] so… namespace 499 (Nova_Resource_Talk) becomes a namespace with subpages? [13:55:24] o_O [13:56:18] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Cleanup mw-on-k8s scripts [puppet] - 10https://gerrit.wikimedia.org/r/1049507 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:57:46] !log Enabling puppet on cp4037.ulsfo.wmnet to test 1050293 - T367949 [13:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:53] (03CR) 10Lucas Werkmeister (WMDE): "I’m very confused by the [diffConfig](https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/1187/console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043797 (https://phabricator.wikimedia.org/T367254) (owner: 10Dreamrimmer) [14:00:23] DreamRimmer: I don’t feel confident deploying that, sorry [14:00:33] hopefully someone™ can figure out the weird config diff… [14:00:38] Code looks good to me, but why is this affecting 498? [14:00:51] (03CR) 10Ssingh: "test-cookbook -c 1052124 --dry-run sre.cdn.roll-reboot --reason 'testing dry run' --alias cp-text_ulsfo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 (owner: 10Ssingh) [14:01:09] Lucas_WMDE: No problem, thanks :) [14:01:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65825 and previous config saved to /var/cache/conftool/dbconfig/20240704-140145-root.json [14:01:55] (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043797 (https://phabricator.wikimedia.org/T367254) (owner: 10Dreamrimmer) [14:01:56] !log Enabling and running puppet on P:trafficserver::backend to merge 1050293 - T367949 [14:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:00] T367949: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [14:03:03] (03CR) 10Lucas Werkmeister (WMDE): [Wikitech] Remove namespace 666 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043797 (https://phabricator.wikimedia.org/T367254) (owner: 10Dreamrimmer) [14:03:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P65826 and previous config saved to /var/cache/conftool/dbconfig/20240704-140313-marostegui.json [14:03:15] !log UTC afternoon backport+config window done [14:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:22] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9954050 (10jijiki) [14:05:38] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9954055 (10jijiki) [14:06:49] Lucas_WMDE: ll try to fix it and reschedule it in the next backport window. [14:07:04] thanks, good luck! [14:07:46] FWIW, the commands you can see in Jenkins before it prints the diff (`php -e …`, `git add -f …`) worked relatively well for me to reproduce the config diff, if you want to try it locally [14:07:56] but you can also upload more versions of the patch and see what diffConfig says in CI ^^ [14:08:45] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:09:27] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:10:56] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9954061 (10Clement_Goubert) [14:11:16] (03CR) 10Jcrespo: [C:04-2] "Yes, although let me check if some of the ideas here were missing to note them somewhere else." [puppet] - 10https://gerrit.wikimedia.org/r/541209 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [14:12:51] Ok, thanks [14:16:19] 06SRE, 10Wikimedia-Mailing-lists: mailman3 discard_held_messages systemd script apparently failing since 2023-03-26 - https://phabricator.wikimedia.org/T336555#9954094 (10eoghan) 05Open→03Resolved a:03eoghan lists1003 doesn't exist anymore, so this can probably be closed. [14:17:19] (03Abandoned) 10Jcrespo: bacula: Remove old storage setup layout and increase concurrency [puppet] - 10https://gerrit.wikimedia.org/r/541209 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [14:18:02] 10SRE-tools, 10conftool, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#9954117 (10ABran-WMF) [14:18:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P65827 and previous config saved to /var/cache/conftool/dbconfig/20240704-141820-marostegui.json [14:24:02] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:24:13] (03PS5) 10Clément Goubert: trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) [14:25:01] (03CR) 10Clément Goubert: trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:25:26] (03CR) 10Elukey: [V:03+2 C:03+2] wmfdebug: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:25:37] (03CR) 10Elukey: [V:03+2 C:03+2] haproxy: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1052049 (https://phabricator.wikimedia.org/T369144) (owner: 10Elukey) [14:25:51] (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:29:37] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Final mw-on-k8s cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1050300 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:29:41] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:29:46] (03PS2) 10Fabfur: hiera: enable benthos on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) [14:30:27] (03CR) 10Fabfur: hiera: enable benthos on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [14:30:47] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:11] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [14:31:17] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9954181 (10Clement_Goubert) [14:33:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T364069)', diff saved to https://phabricator.wikimedia.org/P65829 and previous config saved to /var/cache/conftool/dbconfig/20240704-143327-marostegui.json [14:33:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [14:33:31] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:33:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [14:33:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T364069)', diff saved to https://phabricator.wikimedia.org/P65830 and previous config saved to /var/cache/conftool/dbconfig/20240704-143350-marostegui.json [14:34:27] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! I think is safe to roll out." [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:38:33] (03PS9) 10AOkoth: vtrs: upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) [14:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:35] (03PS1) 10Elukey: services: pin haproxy's version in Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052125 (https://phabricator.wikimedia.org/T369144) [14:40:43] 10SRE-swift-storage, 07Wikimedia-production-error: Unable to undelete File:Boston_Bruins.svg - https://phabricator.wikimedia.org/T369299#9954205 (10Reedy) [14:40:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366366) (owner: 10Jdlrobson) [14:41:01] (03PS1) 10Giuseppe Lavagetto: mediawiki::httpd add APACHE_RUN_PORT to env [puppet] - 10https://gerrit.wikimedia.org/r/1052127 [14:41:01] (03PS1) 10Giuseppe Lavagetto: mediawiki::sites: switch to use APACHE_RUN_PORT [puppet] - 10https://gerrit.wikimedia.org/r/1052128 [14:41:02] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: remove hack for static files [puppet] - 10https://gerrit.wikimedia.org/r/1052129 [14:41:28] (03CR) 10Marostegui: [C:03+1] "thank you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [14:44:19] (03CR) 10Volans: sre.cdn.roll-reboot: log host to SAL post reboot (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 (owner: 10Ssingh) [14:44:33] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229#9954221 (10Marostegui) p:05Triage→03High @Papaul @Jhancock.wm this is a master can we get a new disk? This host was bought 2022-06-19 [14:44:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229#9954224 (10Marostegui) [14:45:29] (03CR) 10CI reject: [V:04-1] mediawiki::sites: switch to use APACHE_RUN_PORT [puppet] - 10https://gerrit.wikimedia.org/r/1052128 (owner: 10Giuseppe Lavagetto) [14:45:30] (03CR) 10CI reject: [V:04-1] mediawiki::web::yaml_defs: remove hack for static files [puppet] - 10https://gerrit.wikimedia.org/r/1052129 (owner: 10Giuseppe Lavagetto) [14:46:39] (03CR) 10CI reject: [V:04-1] mediawiki::httpd add APACHE_RUN_PORT to env [puppet] - 10https://gerrit.wikimedia.org/r/1052127 (owner: 10Giuseppe Lavagetto) [14:47:52] (03CR) 10Ssingh: sre.cdn.roll-reboot: log host to SAL post reboot (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 (owner: 10Ssingh) [14:47:58] (03PS2) 10Giuseppe Lavagetto: mediawiki::httpd add APACHE_RUN_PORT to env [puppet] - 10https://gerrit.wikimedia.org/r/1052127 [14:47:59] (03PS2) 10Giuseppe Lavagetto: mediawiki::sites: switch to use APACHE_RUN_PORT [puppet] - 10https://gerrit.wikimedia.org/r/1052128 [14:47:59] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: remove hack for static files [puppet] - 10https://gerrit.wikimedia.org/r/1052129 [14:51:51] (03PS2) 10Ssingh: sre.cdn.roll-reboot: log host to SAL post reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 [14:53:29] (03CR) 10Clément Goubert: [C:03+1] services: pin haproxy's version in Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052125 (https://phabricator.wikimedia.org/T369144) (owner: 10Elukey) [14:53:54] (03CR) 10Elukey: [C:03+2] services: pin haproxy's version in Thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052125 (https://phabricator.wikimedia.org/T369144) (owner: 10Elukey) [14:56:01] (03CR) 10Ssingh: "Even that doesn't look nice, let me revise it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 (owner: 10Ssingh) [14:56:44] (03PS3) 10Ladsgroup: mariadb: Remove direct grants on mailman databases [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) [14:56:49] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Remove direct grants on mailman databases [puppet] - 10https://gerrit.wikimedia.org/r/1047474 (https://phabricator.wikimedia.org/T367833) (owner: 10Ladsgroup) [14:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:26] (03Abandoned) 10Arnaudb: mariadb: installs backport mysqld-exporter on deb11 [puppet] - 10https://gerrit.wikimedia.org/r/1051388 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [15:00:05] hashar and jeena: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1500) [15:00:06] (03PS3) 10Ssingh: sre.cdn.roll-reboot: log host to SAL post reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 [15:01:03] (03CR) 10Ssingh: "DRY-RUN: cookbooks.sre.cdn.roll-reboot finished rebooting cp4052.ulsfo.wmnet" [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 (owner: 10Ssingh) [15:02:35] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 (owner: 10Ssingh) [15:02:44] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:02:49] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:05:40] (03CR) 10Ssingh: [C:03+2] sre.cdn.roll-reboot: log host to SAL post reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/1052124 (owner: 10Ssingh) [15:07:25] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9954302 (10ayounsi) > Possible, but complicated and fuzzy as to the effects on different scenarios. If we "weight" a single anycast/24 by loading up several distinct NS IPs from it, it also has an outsized negative impact a... [15:08:09] (03CR) 10Elukey: [V:03+1 C:04-1] "After some code reading, I am not 100% sure that we cannot use puppet-merge on puppetserver nodes. It is just a matter of not knowing if i" [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:17:07] (03CR) 10Cathal Mooney: "So thinking this through I think the risk is fairly low." [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [15:17:10] (03CR) 10Cathal Mooney: [C:03+2] Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [15:17:41] (03Merged) 10jenkins-bot: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [15:19:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [15:19:20] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9954321 (10Marostegui) >>! In T368098#9951531, @xcollazo wrote: > It seems, ho... [15:22:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [15:22:50] (03PS1) 10Cathal Mooney: Revert "Add class-of-service scheduler and classifiers plus var to control" [homer/public] - 10https://gerrit.wikimedia.org/r/1052131 [15:23:46] (03CR) 10Cathal Mooney: [C:03+2] Revert "Add class-of-service scheduler and classifiers plus var to control" [homer/public] - 10https://gerrit.wikimedia.org/r/1052131 (owner: 10Cathal Mooney) [15:24:13] (03Merged) 10jenkins-bot: Revert "Add class-of-service scheduler and classifiers plus var to control" [homer/public] - 10https://gerrit.wikimedia.org/r/1052131 (owner: 10Cathal Mooney) [15:24:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [15:25:41] (03PS4) 10Arnaudb: mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) [15:27:06] (03CR) 10Arnaudb: "https://w.wiki/_sUM9 those are sensitive thresholds! the latest version should be OK" [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [15:27:11] (03CR) 10Cathal Mooney: [C:03+2] "Done" [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [15:27:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [15:28:54] (03PS1) 10Ssingh: sre.dns.roll-reboot: add pre_ and post_action SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1052132 [15:31:02] 06SRE, 10DNS, 10fundraising-tech-ops, 06Traffic, 13Patch-For-Review: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9954341 (10AKanji-WMF) Thanks all - confirmed with @RLewis that this should point to https://wikimediafoundation.org/support/benefactors/ [15:31:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [15:31:39] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9954359 (10elukey) Getting back to this after having read more puppet code/configs, I think I have a clearer idea now :) We can easily sp... [15:34:06] (03CR) 10Ssingh: "DRY-RUN: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org" [cookbooks] - 10https://gerrit.wikimedia.org/r/1052132 (owner: 10Ssingh) [15:35:59] (03CR) 10AikoChou: [C:03+1] ml-services: fix MAX_FEATURE_VALS path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052089 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [15:36:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [15:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:39:42] (03CR) 10Elukey: [C:03+1] "I don't see alternatives as well, the kubelet runs as root and registering the unix socket under /var/lib/kubelet/plugins requires elevate" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051732 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:41:02] (03CR) 10Elukey: [C:03+2] docker::reporter: update exclude rules [puppet] - 10https://gerrit.wikimedia.org/r/1051379 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [15:42:41] (03CR) 10Volans: [C:03+1] "LGTM. Up to you if it's too spammy :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1052132 (owner: 10Ssingh) [15:43:19] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: fix MAX_FEATURE_VALS path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052089 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [15:44:09] (03CR) 10Ssingh: "Ha, that's very fair but since it's the DNS hosts -- and we should really know what state they are in -- and the reboots happen very rarel" [cookbooks] - 10https://gerrit.wikimedia.org/r/1052132 (owner: 10Ssingh) [15:44:18] (03CR) 10Ssingh: [C:03+2] sre.dns.roll-reboot: add pre_ and post_action SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1052132 (owner: 10Ssingh) [15:49:17] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:49:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:49:43] (the train log triage did not happen since today is an holiday in the USA) [15:49:57] but the logs look quiet [15:57:37] (03PS1) 10DCausse: cirrus: add cirrussearch-legacy-updater dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 [15:57:48] (03PS1) 10DCausse: cirrus: run the sanitizer only on cirrussearch-legacy-updater dblist [puppet] - 10https://gerrit.wikimedia.org/r/1052136 [15:58:22] (03CR) 10CI reject: [V:04-1] cirrus: add cirrussearch-legacy-updater dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 (owner: 10DCausse) [16:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:55] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052089 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [16:03:06] (03Merged) 10jenkins-bot: ml-services: fix MAX_FEATURE_VALS path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052089 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [16:05:02] (03CR) 10Jbond: [C:03+1] puppetmaster: change git sender email address to git@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1051846 (owner: 10Dzahn) [16:06:21] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:07:32] (03CR) 10Jbond: "Im not against this but keep in mind there are many modules and profiles in production that are never used in cloud. So this could cause " [puppet] - 10https://gerrit.wikimedia.org/r/1051332 (owner: 10David Caro) [16:12:54] (03CR) 10Btullis: [C:03+2] cephcsi: Grant elevated privileges to the driver-registrar container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051732 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:13:03] (03PS2) 10DCausse: cirrus: add cirrussearch-legacy-updater dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 [16:14:21] !log btullis@cumin1002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [16:14:45] (03CR) 10Jbond: [C:03+1] "lgtm some nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1049761 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [16:14:54] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [16:15:05] (03CR) 10Jbond: [C:03+1] pontoon: Remove more puppet 5 leftovers [puppet] - 10https://gerrit.wikimedia.org/r/1047502 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:15:18] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1078.eqiad.wmnet [16:15:32] (03CR) 10Jbond: [C:03+1] mariadb::ferm_wmcs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1037766 (owner: 10Muehlenhoff) [16:15:59] (03Merged) 10jenkins-bot: cephcsi: Grant elevated privileges to the driver-registrar container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051732 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:19:48] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:20:01] (03PS1) 10Elukey: knative-serving: remove _example settings shipped with upstream yamls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052141 (https://phabricator.wikimedia.org/T368359) [16:20:02] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:20:04] (03CR) 10Kamila Součková: [C:03+2] opentelemetry: update k8s API IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048498 (owner: 10Kamila Součková) [16:20:11] (03CR) 10CI reject: [V:04-1] opentelemetry: update k8s API IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048498 (owner: 10Kamila Součková) [16:20:47] (03CR) 10Ssingh: "I am guessing you are looking to merge this next week? I wanted to put an extra pair of eyes and see if we can spot something on why this " [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [16:21:22] (03CR) 10Ssingh: "It's the usual bird related OCD 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [16:23:48] (03Abandoned) 10Kamila Součková: opentelemetry: update k8s API IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048498 (owner: 10Kamila Součková) [16:26:52] (03CR) 10Kamila Součková: [C:03+1] shellbox-video: increase replicas, namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050375 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [16:35:33] PROBLEM - Host an-worker1078 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:04] bd808: That opportune time for a Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1700) [17:07:09] RECOVERY - Host an-worker1078 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:10:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1078.eqiad.wmnet [17:20:27] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:35:53] (03CR) 10Tchanders: [C:03+1] Remove modifications of wgCheckUserLogAdditionalRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050424 (https://phabricator.wikimedia.org/T346022) (owner: 10Dreamy Jazz) [17:37:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T364069)', diff saved to https://phabricator.wikimedia.org/P65831 and previous config saved to /var/cache/conftool/dbconfig/20240704-173735-marostegui.json [17:37:39] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:39:22] (03PS1) 10Ssingh: geo-maps: send BR (Brazil) to magru [dns] - 10https://gerrit.wikimedia.org/r/1052144 (https://phabricator.wikimedia.org/T359054) [17:40:52] (03CR) 10Ssingh: [C:04-2] "Do not merge before July 11 10:00 UTC" [dns] - 10https://gerrit.wikimedia.org/r/1052144 (https://phabricator.wikimedia.org/T359054) (owner: 10Ssingh) [17:42:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [17:42:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [17:47:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [17:47:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [17:50:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [17:52:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P65832 and previous config saved to /var/cache/conftool/dbconfig/20240704-175242-marostegui.json [17:55:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [18:00:05] hashar and jeena: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T1800) [18:07:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P65833 and previous config saved to /var/cache/conftool/dbconfig/20240704-180749-marostegui.json [18:22:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T364069)', diff saved to https://phabricator.wikimedia.org/P65834 and previous config saved to /var/cache/conftool/dbconfig/20240704-182257-marostegui.json [18:22:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [18:23:01] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [18:23:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [18:23:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T364069)', diff saved to https://phabricator.wikimedia.org/P65835 and previous config saved to /var/cache/conftool/dbconfig/20240704-182308-marostegui.json [18:45:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:51:26] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:32] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314 (10JJMC89) 03NEW [18:56:27] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 30.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:58:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [18:59:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [18:59:28] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9954997 (10Urbanecm) Approved. [18:59:35] (03PS1) 10NMW03: Add editcontentmodel to interface-admin for French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052158 (https://phabricator.wikimedia.org/T369113) [19:00:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052158 (https://phabricator.wikimedia.org/T369113) (owner: 10NMW03) [19:03:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [19:03:51] jouncebot: next [19:03:52] In 0 hour(s) and 56 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T2000) [19:04:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [19:06:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [19:06:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [19:11:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [19:11:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [19:18:32] (03CR) 10Klausman: [C:03+1] "I am saddened by this weird breakage, I congratulate you on figuring it out and wholeheartedly approve this change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052141 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [19:24:29] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 400.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:45:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:46:26] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:55:00] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_eqiad [19:57:29] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:57:33] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_eqiad [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240704T2000) [20:00:05] jan_drewniak, Tchanders, and Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:29] o/ [20:00:32] 0/ [20:00:42] hi o/ [20:02:43] Nemoralis, Tchanders since US is on holiday today, I can do the deploy [20:02:54] thank you [20:03:22] jan_drewniak: thanks [20:03:52] since these are all config changes, I'll do all three patches at once. I'll let you both know when it's ready for testing [20:04:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366366) (owner: 10Jdlrobson) [20:04:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050424 (https://phabricator.wikimedia.org/T346022) (owner: 10Dreamy Jazz) [20:04:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052158 (https://phabricator.wikimedia.org/T369113) (owner: 10NMW03) [20:05:14] (03Merged) 10jenkins-bot: [July 4th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366366) (owner: 10Jdlrobson) [20:05:16] (03Merged) 10jenkins-bot: Remove modifications of wgCheckUserLogAdditionalRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050424 (https://phabricator.wikimedia.org/T346022) (owner: 10Dreamy Jazz) [20:05:17] (03Merged) 10jenkins-bot: Add editcontentmodel to interface-admin for French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052158 (https://phabricator.wikimedia.org/T369113) (owner: 10NMW03) [20:05:34] !log jdrewniak@deploy1002 Started scap sync-world: Backport for [[gerrit:1050671|[July 4th] Reduce list of exclusions for dark mode (1.43.0-wmf.12)]], [[gerrit:1050424|Remove modifications of wgCheckUserLogAdditionalRights (T346022)]], [[gerrit:1052158|Add editcontentmodel to interface-admin for French Wikipedia (T369113)]] [20:05:39] T346022: Remove modifications of wgCheckUserLogAdditionalRights in code outside CheckUser - https://phabricator.wikimedia.org/T346022 [20:05:39] T369113: fr.wiki: give the right of editcontentmodel to interface-admin - https://phabricator.wikimedia.org/T369113 [20:07:28] (03CR) 10BCornwall: [V:03+1 C:03+2] "Hostnames that have very few requests: The destination hits all show e.g. 1 or 2 hits per second (often 0) except cs.wikipedia.com which h" [puppet] - 10https://gerrit.wikimedia.org/r/1047191 (owner: 10BCornwall) [20:08:52] !log jdrewniak@deploy1002 jdlrobson, nmw03, jdrewniak, dreamyjazz: Backport for [[gerrit:1050671|[July 4th] Reduce list of exclusions for dark mode (1.43.0-wmf.12)]], [[gerrit:1050424|Remove modifications of wgCheckUserLogAdditionalRights (T346022)]], [[gerrit:1052158|Add editcontentmodel to interface-admin for French Wikipedia (T369113)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:51] Tchanders, Nemoralis, your patches are on the test servers [20:09:57] LGTM [20:10:15] Thanks, checking.. [20:10:27] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Liste_des_droits_de_groupe#interface-admin [20:11:55] Looks good [20:12:44] Tchanders, Nemoralis, ok syncing [20:12:48] !log jdrewniak@deploy1002 jdlrobson, nmw03, jdrewniak, dreamyjazz: Continuing with sync [20:13:08] jan_drewniak: thank you! [20:17:48] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1050671|[July 4th] Reduce list of exclusions for dark mode (1.43.0-wmf.12)]], [[gerrit:1050424|Remove modifications of wgCheckUserLogAdditionalRights (T346022)]], [[gerrit:1052158|Add editcontentmodel to interface-admin for French Wikipedia (T369113)]] (duration: 12m 14s) [20:17:52] T346022: Remove modifications of wgCheckUserLogAdditionalRights in code outside CheckUser - https://phabricator.wikimedia.org/T346022 [20:17:52] T369113: fr.wiki: give the right of editcontentmodel to interface-admin - https://phabricator.wikimedia.org/T369113 [20:18:26] jan_drewniak: thank you! [20:19:59] Tchanders, Nemoralis, alrighty, that was much quicker than expected, but patches are synced! [20:20:35] Fastest deploy I've seen in quite a while! [20:22:29] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:25:17] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift [20:26:15] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Swift [20:31:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [20:31:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [20:36:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [20:36:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [20:47:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:50:01] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:03:05] jouncebot: next [21:03:06] In 8 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240705T0600) [21:03:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:07:11] (03PS8) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [21:07:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [21:07:43] (03CR) 10CI reject: [V:04-1] Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [21:08:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:09:20] (03CR) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data (035 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [21:12:08] (03PS1) 10Pppery: Missing.php: check REQUEST_URI in addition to PATH_INFO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052165 (https://phabricator.wikimedia.org/T9496) [21:12:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [21:15:33] (03PS9) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [21:16:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T364069)', diff saved to https://phabricator.wikimedia.org/P65836 and previous config saved to /var/cache/conftool/dbconfig/20240704-211644-marostegui.json [21:16:47] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:17:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:17:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [21:20:27] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:22:19] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:22:19] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [21:22:49] (03PS1) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1052167 (https://phabricator.wikimedia.org/T339850) [21:25:16] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#9955217 (10cmooney) [21:31:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P65837 and previous config saved to /var/cache/conftool/dbconfig/20240704-213151-marostegui.json [21:34:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:34:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [21:38:37] 10ops-codfw, 06SRE, 06DC-Ops, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9955241 (10lmata) [21:38:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9955243 (10lmata) [21:39:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:39:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [21:41:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:41:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [21:46:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:46:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [21:46:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P65838 and previous config saved to /var/cache/conftool/dbconfig/20240704-214658-marostegui.json [21:52:43] (03PS1) 10Mvolz: Revert^2 "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052176 [21:53:06] (03CR) 10Mvolz: [C:03+2] Revert^2 "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052176 (owner: 10Mvolz) [21:53:58] (03Merged) 10jenkins-bot: Revert^2 "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052176 (owner: 10Mvolz) [21:59:32] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [21:59:55] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [22:00:41] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [22:01:21] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [22:02:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T364069)', diff saved to https://phabricator.wikimedia.org/P65839 and previous config saved to /var/cache/conftool/dbconfig/20240704-220205-marostegui.json [22:02:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [22:02:09] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:02:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [22:02:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T364069)', diff saved to https://phabricator.wikimedia.org/P65840 and previous config saved to /var/cache/conftool/dbconfig/20240704-220227-marostegui.json [22:03:47] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [22:04:16] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [22:15:10] (03PS1) 10Urbanecm: stewards: Add Phabricator API configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052185 (https://phabricator.wikimedia.org/T369322) [22:16:11] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052185 (https://phabricator.wikimedia.org/T369322) (owner: 10Urbanecm) [22:19:20] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9955306 (10AndyRussG) hi @Volans, @KFrancis! I received the the NDA at my WMDE address and signed it. I also updated my e-mail address on `https:/... [22:56:56] (03PS1) 10Urbanecm: lists::automation: Update stewards-l in real mode [puppet] - 10https://gerrit.wikimedia.org/r/1052188 (https://phabricator.wikimedia.org/T351202) [23:07:19] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [23:07:20] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [23:09:40] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9955320 (10Urbanecm) 05Stalled→03... [23:12:19] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [23:12:20] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [23:27:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [23:27:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [23:32:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [23:32:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [23:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052191 [23:38:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052191 (owner: 10TrainBranchBot)