[00:00:25] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:09] <wikibugs>	 (03PS1) 10Cwhite: Remove non-kafka logstash nodes from kafka configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/886862 (https://phabricator.wikimedia.org/T329142)
[00:09:53] <icinga-wm>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2023-01-31 00:00:13 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:12:08] <wikibugs>	 (03PS1) 10Cwhite: logstash: enable error.stack.previous_trace [puppet] - 10https://gerrit.wikimedia.org/r/886863 (https://phabricator.wikimedia.org/T314098)
[00:13:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T328817)', diff saved to https://phabricator.wikimedia.org/P43919 and previous config saved to /var/cache/conftool/dbconfig/20230209-001340-marostegui.json
[00:13:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[00:13:44] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[00:13:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[00:14:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T328817)', diff saved to https://phabricator.wikimedia.org/P43920 and previous config saved to /var/cache/conftool/dbconfig/20230209-001401-marostegui.json
[00:14:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: enable error.stack.previous_trace [puppet] - 10https://gerrit.wikimedia.org/r/886863 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite)
[00:16:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T328817)', diff saved to https://phabricator.wikimedia.org/P43921 and previous config saved to /var/cache/conftool/dbconfig/20230209-001613-marostegui.json
[00:17:39] <wikibugs>	 (03PS2) 10Cwhite: logstash: enable error.stack.previous_trace [puppet] - 10https://gerrit.wikimedia.org/r/886863 (https://phabricator.wikimedia.org/T314098)
[00:18:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Backy2 backup jobs: don't email on failure [puppet] - 10https://gerrit.wikimedia.org/r/886470 (https://phabricator.wikimedia.org/T328868) (owner: 10Andrew Bogott)
[00:19:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43922 and previous config saved to /var/cache/conftool/dbconfig/20230209-001910-ladsgroup.json
[00:19:13] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[00:22:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:22:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2423.codfw.wmnet with OS buster
[00:22:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2423.codfw.wmnet with OS buster completed: - mw2423 (**PASS**)   - Removed from Pupp...
[00:24:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2424.codfw.wmnet with OS buster
[00:24:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2424.codfw.wmnet with OS buster
[00:27:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[00:31:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P43923 and previous config saved to /var/cache/conftool/dbconfig/20230209-003119-marostegui.json
[00:34:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43924 and previous config saved to /var/cache/conftool/dbconfig/20230209-003416-ladsgroup.json
[00:40:39] <icinga-wm>	 RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2023-02-07 15:56:09 (4056 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:41:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2425.codfw.wmnet with OS buster
[00:41:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2425.codfw.wmnet with OS buster
[00:46:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P43925 and previous config saved to /var/cache/conftool/dbconfig/20230209-004625-marostegui.json
[00:49:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43926 and previous config saved to /var/cache/conftool/dbconfig/20230209-004923-ladsgroup.json
[00:50:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2424.codfw.wmnet with reason: host reimage
[00:53:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2424.codfw.wmnet with reason: host reimage
[01:00:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2425.codfw.wmnet with reason: host reimage
[01:01:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T328817)', diff saved to https://phabricator.wikimedia.org/P43927 and previous config saved to /var/cache/conftool/dbconfig/20230209-010132-marostegui.json
[01:01:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[01:01:36] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[01:01:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[01:03:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2425.codfw.wmnet with reason: host reimage
[01:04:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43928 and previous config saved to /var/cache/conftool/dbconfig/20230209-010429-ladsgroup.json
[01:04:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[01:04:33] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[01:04:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[01:04:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43929 and previous config saved to /var/cache/conftool/dbconfig/20230209-010450-ladsgroup.json
[01:09:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:17:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:22:34] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:22:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2425.codfw.wmnet with OS buster
[01:22:38] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:22:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2424.codfw.wmnet with OS buster
[01:22:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2425.codfw.wmnet with OS buster completed: - mw2425 (**PASS**)   - Removed from Pupp...
[01:22:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2424.codfw.wmnet with OS buster completed: - mw2424 (**PASS**)   - Removed from Pupp...
[01:23:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2426.codfw.wmnet with OS buster
[01:23:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2426.codfw.wmnet with OS buster
[01:27:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2427.codfw.wmnet with OS buster
[01:28:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2427.codfw.wmnet with OS buster
[01:36:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[01:42:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2426.codfw.wmnet with reason: host reimage
[01:45:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2426.codfw.wmnet with reason: host reimage
[01:47:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2427.codfw.wmnet with reason: host reimage
[01:47:54] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[01:48:58] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[01:50:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2427.codfw.wmnet with reason: host reimage
[02:00:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:04:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43930 and previous config saved to /var/cache/conftool/dbconfig/20230209-020401-ladsgroup.json
[02:04:04] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[02:04:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:10:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:10:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2426.codfw.wmnet with OS buster
[02:10:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2426.codfw.wmnet with OS buster completed: - mw2426 (**PASS**)   - Removed from Pupp...
[02:10:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:11:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2427.codfw.wmnet with OS buster
[02:11:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2427.codfw.wmnet with OS buster completed: - mw2427 (**PASS**)   - Removed from Pupp...
[02:11:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2428.codfw.wmnet with OS buster
[02:11:36] <musikanimal>	 could I get a deployer to run a quick and harmless maintenance script on zhwiki for me? or should that go through a backport window? (there's no patch)
[02:11:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2428.codfw.wmnet with OS buster
[02:11:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[02:17:04] <TheresNoTime>	 musikanimal: depends on the script :D
[02:18:42] <musikanimal>	 TheresNoTime: there was a botched deploy of PageAssessments to zhwiki
[02:18:50] <musikanimal>	 https://phabricator.wikimedia.org/T328224 for context
[02:19:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43931 and previous config saved to /var/cache/conftool/dbconfig/20230209-021907-ladsgroup.json
[02:19:30] <TheresNoTime>	 looking..
[02:20:09] <musikanimal>	 we need the purgeUnusedProjects.php maintenance script to be ran
[02:20:46] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:54] <musikanimal>	 I removed the page assessments, and that table is now (almost) empty... but anyway the page_assessments_projects are the corrupt data, and running purgeUnusedProjects.php should clear those out
[02:21:12] <TheresNoTime>	 okay, one moment :)
[02:21:51] <musikanimal>	 I have prod db access FYI, just read-only. `SELECT COUNT(*) FROM page_assessments_projects` reports 479, that should be zero (or very close to zero)
[02:23:37] <TheresNoTime>	 !log `[samtar@mwmaint1002 ~]$ mwscript extensions/PageAssessments/maintenance/purgeUnusedProjects.php --wiki zhwiki --dry-run` for T326387
[02:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:40] <stashbot>	 T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387
[02:24:36] <TheresNoTime>	 musikanimal: does https://phabricator.wikimedia.org/P43932 look reasonable?
[02:25:11] <musikanimal>	 yep! though I wonder if you're able to TRUNCATE `page_assessments` first? then it would actually be zero
[02:25:43] <musikanimal>	 apparently there's a bug in PageAssessments where pages moved without redirect leave the assessments behind, so there's 7 rows in `page_assessments` for non-existent pages, and so the maintenance script thinks those WikiProjects are being used
[02:26:21] <musikanimal>	 doing direct db writes/deletes etc seems scary so no worries if you don't want to or can't
[02:26:32] <musikanimal>	 there will only be a few rows of bad data, no big deal :)
[02:26:54] <TheresNoTime>	 musikanimal: okay, I am `sql zhwiki`, and I am going to run `TRUNCATE TABLE page_assessments;`, correct?
[02:27:04] <Reedy>	 It won't work like that
[02:27:11] <Reedy>	 You're connected to a replica
[02:27:21] <TheresNoTime>	 ah
[02:27:40] <TheresNoTime>	 then I'm going to run the maintenance script and we can worry about those 7 later, sound okay musikanimal?
[02:27:53] <Reedy>	 you can do it via connecting to the master ;P
[02:28:29] <TheresNoTime>	 !log `[samtar@mwmaint1002 ~]$ mwscript extensions/PageAssessments/maintenance/purgeUnusedProjects.php --wiki zhwiki` for T326387
[02:28:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:04] <musikanimal>	 that's fine, I know which rows should be removed and we can remove them later (or not)
[02:30:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2428.codfw.wmnet with reason: host reimage
[02:30:46] <musikanimal>	 okay yeah, those 7 leftover rows in page_assessments_projects are not task forces (sub-WikiProjects), so it doesn't matter anyway. We're all set!
[02:30:48] <musikanimal>	 thank you!!
[02:30:54] <TheresNoTime>	 No worries :)
[02:31:07] <musikanimal>	 now I can add back the parser function then things should be stored correctly
[02:33:05] <TheresNoTime>	 Reedy: running a `TRUNCATE` is scary enough tyvm :p
[02:33:46] <Reedy>	 if there's only 7 rows, you could do delete from table where id in [ list ];
[02:33:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2428.codfw.wmnet with reason: host reimage
[02:34:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43933 and previous config saved to /var/cache/conftool/dbconfig/20230209-023413-ladsgroup.json
[02:34:29] <musikanimal>	 lol, I agree TRUNCATE is scarrrry!
[02:36:13] <wikibugs>	 (03PS1) 10Raymond Ndibe: puppet: adapt replica_cnf_api to python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/887872 (https://phabricator.wikimedia.org/T304040)
[02:36:58] <musikanimal>	 so interesting thing I've been wondering about... the API says there are zero jobs in the queue on zhwiki, but for sure there's about 800K+ that just got fired off after the template I just edited
[02:37:20] <musikanimal>	 why is that, and where should I go to see the actual number of pending jobs?
[02:37:24] <Reedy>	 I think general advice is to ignore what the API says for the job queue size
[02:38:51] <musikanimal>	 haha ok. I guess I can query `job` directly
[02:40:04] <Reedy>	 https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue#Monitoring
[02:40:28] <TheresNoTime>	 https://logstash.wikimedia.org/goto/ccd5e2517591489ad88bc66922f8311c being a dead link, *chef kiss*
[02:40:50] <Reedy>	 heh
[02:40:51] <musikanimal>	 yeah
[02:41:01] <Reedy>	 !bug 1
[02:41:02] <wm-bot>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=1
[02:41:11] <musikanimal>	 well they're not in the `job` table, I just queried and that is in fact 0 rows
[02:41:39] <musikanimal>	 haha!! nice one Reedy, I'm going to have to remember that
[02:41:40] <Reedy>	 yeah, WMF production hasn't used the job table in a looong time
[02:41:56] <musikanimal>	 I'm guessing that's what the API is reporting?
[02:42:08] <Reedy>	 I think it can report *some* other sources
[02:43:19] <TheresNoTime>	 I *think* it's meant to be https://logstash.wikimedia.org/goto/d6d10e8e40672fcca72e3e556b7af954 ?
[02:45:32] <Reedy>	 https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue and https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1
[02:45:41] <Reedy>	 It's hard to know what old links are actually supposed to be poitning to
[02:48:08] <TheresNoTime>	 well I fixed the logstash link in https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue#Logs (but left the old one in a <!-- comment --> just in case..)
[02:48:22] <Reedy>	 TheresNoTime: If only the pages had history... :P
[02:48:49] * TheresNoTime mutters
[02:49:05] <wikibugs>	 (03CR) 10Raymond Ndibe: "all tests passing both unit tests and functional tests on dbusers-nfs-1.testlabs.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/887872 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[02:49:12] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:49:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43934 and previous config saved to /var/cache/conftool/dbconfig/20230209-024920-ladsgroup.json
[02:49:23] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[02:50:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:54:36] <wikibugs>	 (03CR) 10Raymond Ndibe: puppet: modify role::wmcs::nfs::primary for replica_cnf api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[02:56:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:56:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2428.codfw.wmnet with OS buster
[02:56:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2428.codfw.wmnet with OS buster completed: - mw2428 (**PASS**)   - Removed from Pupp...
[03:12:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[03:47:19] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Samwilson) Related Community Wishlist Survey proposal: [[https://meta.wikimedia.org/wiki...
[04:44:14] <wikibugs>	 (03PS1) 10KartikMistry: CX: Provide the appropriate arguments to ve.ui.CXSurface constructor [extensions/ContentTranslation] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887847 (https://phabricator.wikimedia.org/T329154)
[05:10:25] <wikibugs>	 (03CR) 10Legoktm: "Overall looks fine, my two comments aren't blockers, just suggestions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm)
[05:10:42] <legoktm>	 urbanecm: hope that helps
[05:56:03] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:56:07] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:57:07] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:01:17] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:01:19] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:02:21] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:32:11] <marostegui>	 I am switching over phabricator master in 30 minutes, meaning 1 minute of read only time
[06:38:29] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887727 (https://phabricator.wikimedia.org/T329141)
[06:40:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887727 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui)
[06:48:09] <logmsgbot>	 !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter status all services in eqiad: maintenance
[06:48:17] <logmsgbot>	 !log oblivian@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in eqiad: maintenance
[06:54:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: sre.discovery.datacenter: rename and add status command (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[06:55:35] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T0700)
[07:00:04] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T0700).
[07:00:06] <marostegui>	 !log Failover m3 from db1164 to db1159 - T329141
[07:00:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:09] <stashbot>	 T329141: Switchover m3 master db1164 -> db1159 - https://phabricator.wikimedia.org/T329141
[07:02:47] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui)
[07:03:29] <wikibugs>	 (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887877 (https://phabricator.wikimedia.org/T329143)
[07:04:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887877 (https://phabricator.wikimedia.org/T329143) (owner: 10Marostegui)
[07:04:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance
[07:04:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance
[07:09:07] <wikibugs>	 (03PS1) 10Marostegui: db1098: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/887878 (https://phabricator.wikimedia.org/T329171)
[07:09:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[07:09:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[07:09:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1098: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/887878 (https://phabricator.wikimedia.org/T329171) (owner: 10Marostegui)
[07:10:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1098 (s6, s7) from dbctl T329171', diff saved to https://phabricator.wikimedia.org/P43935 and previous config saved to /var/cache/conftool/dbconfig/20230209-071013-marostegui.json
[07:10:17] <stashbot>	 T329171: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171
[07:18:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance
[07:19:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance
[07:19:49] <wikibugs>	 (03PS1) 10Marostegui: db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887879 (https://phabricator.wikimedia.org/T329171)
[07:20:59] <wikibugs>	 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) >>! In T328623#8594346, @Jhancock.wm wrote: > We did some more troubleshooting and it looks like the slot for DIMM_B4 is bad. This may need a MB replacement to fully fix.   Thanks - just let me know whe...
[07:21:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887879 (https://phabricator.wikimedia.org/T329171) (owner: 10Marostegui)
[07:21:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance
[07:21:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance
[07:22:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T328817)', diff saved to https://phabricator.wikimedia.org/P43936 and previous config saved to /var/cache/conftool/dbconfig/20230209-072204-marostegui.json
[07:22:07] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[07:23:55] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/887880 (https://phabricator.wikimedia.org/T329143)
[07:24:40] <marostegui>	 !log Stop mariadb on db1117:3321 (some dbproxy irc alerts will be triggered) T329143
[07:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:43] <stashbot>	 T329143: Move db1164 to m1 - https://phabricator.wikimedia.org/T329143
[07:25:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T328817)', diff saved to https://phabricator.wikimedia.org/P43938 and previous config saved to /var/cache/conftool/dbconfig/20230209-072535-marostegui.json
[07:26:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/887880 (https://phabricator.wikimedia.org/T329143) (owner: 10Marostegui)
[07:33:41] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:40:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P43939 and previous config saved to /var/cache/conftool/dbconfig/20230209-074042-marostegui.json
[07:45:17] <marostegui>	 Again, dbproxy irc alerts are expected
[07:48:28] <wikibugs>	 (03CR) 10Elukey: "Ben I left a comment about a setting that may cause a runtime error from puppet, lemme know what you think. My knowledge about profile::ha" [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis)
[07:52:11] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:55:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P43940 and previous config saved to /var/cache/conftool/dbconfig/20230209-075548-marostegui.json
[08:00:04] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T0800).
[08:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:28] <apergos>	 morning!  woops, forgot to check the deployments calendar, give me one sec
[08:00:46] * kart_ is here
[08:00:56] <apergos>	 no trainees signed up for the slot
[08:01:12] <apergos>	 kart_: care to self-deploy? I know you're usually good for it
[08:01:12] <kart_>	 apergos: I guess, I can go ahead..
[08:01:25] <kart_>	 apergos: yes :)
[08:01:45] <apergos>	 okey dokey, it's all you!
[08:02:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887847 (https://phabricator.wikimedia.org/T329154) (owner: 10KartikMistry)
[08:06:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey)
[08:09:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[08:10:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T328817)', diff saved to https://phabricator.wikimedia.org/P43941 and previous config saved to /var/cache/conftool/dbconfig/20230209-081054-marostegui.json
[08:10:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance
[08:10:58] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[08:11:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance
[08:11:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T328817)', diff saved to https://phabricator.wikimedia.org/P43942 and previous config saved to /var/cache/conftool/dbconfig/20230209-081116-marostegui.json
[08:14:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T328817)', diff saved to https://phabricator.wikimedia.org/P43943 and previous config saved to /var/cache/conftool/dbconfig/20230209-081433-marostegui.json
[08:17:10] <wikibugs>	 (03CR) 10Muehlenhoff: add SPDX license headers to various roles I was involved in writing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887382 (owner: 10Dzahn)
[08:18:35] <wikibugs>	 (03Merged) 10jenkins-bot: CX: Provide the appropriate arguments to ve.ui.CXSurface constructor [extensions/ContentTranslation] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887847 (https://phabricator.wikimedia.org/T329154) (owner: 10KartikMistry)
[08:19:04] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:887847|CX: Provide the appropriate arguments to ve.ui.CXSurface constructor (T329154)]]
[08:19:07] <stashbot>	 T329154: Content Translation is broken in test wiki - https://phabricator.wikimedia.org/T329154
[08:20:15] <wikibugs>	 (03PS24) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767)
[08:20:23] <wikibugs>	 (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[08:20:59] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:887847|CX: Provide the appropriate arguments to ve.ui.CXSurface constructor (T329154)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[08:24:05] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:24:15] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:25:45] <wikibugs>	 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui)
[08:29:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P43944 and previous config saved to /var/cache/conftool/dbconfig/20230209-082940-marostegui.json
[08:29:47] <wikibugs>	 (03PS1) 10Vgutierrez: cp4044: Enable ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799)
[08:31:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/886331 (owner: 10Slyngshede)
[08:32:09] <apergos>	 how's it looking?
[08:32:14] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:887847|CX: Provide the appropriate arguments to ve.ui.CXSurface constructor (T329154)]] (duration: 13m 10s)
[08:32:17] <stashbot>	 T329154: Content Translation is broken in test wiki - https://phabricator.wikimedia.org/T329154
[08:34:57] <dcausse>	  /buffer 6
[08:41:06] <apergos>	 kart_: ?  how are things?
[08:41:33] <kart_>	 apergos: all done. Sorry for delay.
[08:41:46] <apergos>	 ok! no worries, that's the window for today then
[08:42:10] <apergos>	 !log UTC morning backport and config training window complete 
[08:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:30] <apergos>	 see everyone here again next time!
[08:44:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P43945 and previous config saved to /var/cache/conftool/dbconfig/20230209-084446-marostegui.json
[08:46:09] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:40] <wikibugs>	 (03PS1) 10Marostegui: add_cuc_only_for_read_old_T329203.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887884 (https://phabricator.wikimedia.org/T329203)
[08:55:51] <kostajh>	 apergos: can I backport something now?
[08:55:54] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/887843 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney)
[08:56:00] <kostajh>	 I can also wait until afternoon window
[08:56:47] <zabe>	 jouncebot, next
[08:56:48] <jouncebot>	 In 2 hour(s) and 3 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100)
[08:56:48] <jouncebot>	 In 2 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100)
[08:57:24] <zabe>	 I am not them, but since there is nothing after this window for 2 hours you could just deploy somewhere in that time period
[08:57:53] <vgutierrez>	 !log depool cp4044 - T308799
[08:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:57] <stashbot>	 T308799: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799
[08:58:24] <kostajh>	 zabe: ok, I'll get started then
[08:58:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cp4044: Enable ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[08:58:58] <wikibugs>	 (03PS1) 10Marostegui: monitoring.yaml: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259)
[08:59:11] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui)
[08:59:14] <wikibugs>	 (03CR) 10Vgutierrez: cp4044: Enable ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[08:59:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T328817)', diff saved to https://phabricator.wikimedia.org/P43946 and previous config saved to /var/cache/conftool/dbconfig/20230209-085952-marostegui.json
[08:59:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance
[08:59:57] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[09:00:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance
[09:00:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[09:00:12] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[09:00:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T328817)', diff saved to https://phabricator.wikimedia.org/P43947 and previous config saved to /var/cache/conftool/dbconfig/20230209-090018-marostegui.json
[09:00:52] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39477/console" [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[09:01:53] <wikibugs>	 (03PS1) 10Kosta Harlan: ComputedUserImpactLookup: Reduce logspam for page view rate limiting [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945)
[09:02:12] <wikibugs>	 (03PS1) 10Kosta Harlan: Add StatusValue::hasMessagesExcept() [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887849 (https://phabricator.wikimedia.org/T272081)
[09:02:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ComputedUserImpactLookup: Reduce logspam for page view rate limiting [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) (owner: 10Kosta Harlan)
[09:02:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T328817)', diff saved to https://phabricator.wikimedia.org/P43948 and previous config saved to /var/cache/conftool/dbconfig/20230209-090236-marostegui.json
[09:02:46] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cp4044: Enable ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[09:03:01] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[09:04:08] <wikibugs>	 (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) (owner: 10Kosta Harlan)
[09:04:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto)
[09:04:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887849 (https://phabricator.wikimedia.org/T272081) (owner: 10Kosta Harlan)
[09:05:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update cloud proxies [puppet] - 10https://gerrit.wikimedia.org/r/887798 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[09:07:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Arclamp [puppet] - 10https://gerrit.wikimedia.org/r/887769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:07:36] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "On the devtools project, I have rebooted our testing Phabricator instance phabricator-prod-1001 and confirmed phd failed to start." [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[09:08:34] <wikibugs>	 (03PS10) 10Hashar: phabricator: create phd home directory on service start [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146)
[09:08:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 db1132', diff saved to https://phabricator.wikimedia.org/P43949 and previous config saved to /var/cache/conftool/dbconfig/20230209-090846-root.json
[09:09:22] <vgutierrez>	 !log pool cp4044 with ESI testing enabled
[09:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:34] <marostegui>	 !log Install 10.6.12 on db1132 T329011
[09:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:37] <stashbot>	 T329011: Compile and package MariaDB 10.4.28 and 10.6.12 - https://phabricator.wikimedia.org/T329011
[09:10:42] <marostegui>	 !log Install 10.4.28 on db1107 T329011
[09:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43950 and previous config saved to /var/cache/conftool/dbconfig/20230209-091145-root.json
[09:11:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43951 and previous config saved to /var/cache/conftool/dbconfig/20230209-091149-root.json
[09:13:28] <moritzm>	 !log installing openssl security updates on Bullseye
[09:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P43952 and previous config saved to /var/cache/conftool/dbconfig/20230209-091742-marostegui.json
[09:18:50] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui)
[09:20:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add StatusValue::hasMessagesExcept() [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887849 (https://phabricator.wikimedia.org/T272081) (owner: 10Kosta Harlan)
[09:20:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Upgrade plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite)
[09:20:31] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:887849|Add StatusValue::hasMessagesExcept() (T272081)]]
[09:20:35] <stashbot>	 T272081: Introduce StatusValue::ignore method - https://phabricator.wikimedia.org/T272081
[09:20:35] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I made the RuntimeDirectory relative. Applied the patch on the puppetmaster, rebooted the instance and this time it works with `/var/run/p" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[09:22:23] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:887849|Add StatusValue::hasMessagesExcept() (T272081)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[09:22:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "A bit more of tech-debt but Good Enough™, and probably better than forcing statsd.eqiad.wmnet to v4" [puppet] - 10https://gerrit.wikimedia.org/r/887804 (owner: 10Herron)
[09:24:47] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "cp4044: Enable ESI testing" [puppet] - 10https://gerrit.wikimedia.org/r/887850
[09:25:13] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "cp4044: Enable ESI testing" [puppet] - 10https://gerrit.wikimedia.org/r/887850 (https://phabricator.wikimedia.org/T308799)
[09:25:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "cp4044: Enable ESI testing" [puppet] - 10https://gerrit.wikimedia.org/r/887850 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[09:26:10] <wikibugs>	 (03PS2) 10Filippo Giunchedi: opensearch_dashboards: enforce memory limit [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161)
[09:26:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi)
[09:26:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43953 and previous config saved to /var/cache/conftool/dbconfig/20230209-092650-root.json
[09:26:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43954 and previous config saved to /var/cache/conftool/dbconfig/20230209-092654-root.json
[09:27:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] opensearch_dashboards: enforce memory limit [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi)
[09:27:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey)
[09:27:15] <wikibugs>	 (03PS6) 10Elukey: cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743
[09:27:20] <wikibugs>	 (03CR) 10Elukey: [V: 03+2] cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey)
[09:28:51] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-client-10.4: Update version [software] - 10https://gerrit.wikimedia.org/r/887935 (https://phabricator.wikimedia.org/T329011)
[09:29:52] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:887849|Add StatusValue::hasMessagesExcept() (T272081)]] (duration: 09m 20s)
[09:29:55] <stashbot>	 T272081: Introduce StatusValue::ignore method - https://phabricator.wikimedia.org/T272081
[09:30:08] <wikibugs>	 10SRE, 10Data-Persistence, 10Discovery-Search, 10serviceops, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert)
[09:31:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.4: Update version [software] - 10https://gerrit.wikimedia.org/r/887935 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui)
[09:31:39] <kostajh>	 on to the next one
[09:32:20] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-client-10.4: Update version [software] - 10https://gerrit.wikimedia.org/r/887935 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui)
[09:32:31] <godog>	 !log roll-restart opensearch-dashboards to apply memory limit - T327161
[09:32:32] <wikibugs>	 (03PS2) 10Kosta Harlan: ComputedUserImpactLookup: Reduce logspam for page view rate limiting [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945)
[09:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:34] <stashbot>	 T327161: opensearch OOM on logstash102[34] - https://phabricator.wikimedia.org/T327161
[09:32:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) (owner: 10Kosta Harlan)
[09:32:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P43955 and previous config saved to /var/cache/conftool/dbconfig/20230209-093248-marostegui.json
[09:32:58] <kostajh>	 I had to remove the Depends-On in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/887848/ for the patch to work with scap backport
[09:34:51] <wikibugs>	 (03CR) 10Elukey: "Testing the code on cumin1001 with Dry-run, fixing little bugs and then report back." [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:36:47] <wikibugs>	 (03PS3) 10FNegri: Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797
[09:37:08] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Update to 10.6.12 [software] - 10https://gerrit.wikimedia.org/r/887936 (https://phabricator.wikimedia.org/T329011)
[09:37:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Update to 10.6.12 [software] - 10https://gerrit.wikimedia.org/r/887936 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui)
[09:38:18] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Update to 10.6.12 [software] - 10https://gerrit.wikimedia.org/r/887936 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui)
[09:40:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: move kwakuofori to ops [puppet] - 10https://gerrit.wikimedia.org/r/887937 (https://phabricator.wikimedia.org/T328787)
[09:41:19] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) I'll let @LSobanski answer authoritatively for Phabricator and Etherpad. We are not switching over...
[09:41:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43956 and previous config saved to /var/cache/conftool/dbconfig/20230209-094154-root.json
[09:42:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43957 and previous config saved to /var/cache/conftool/dbconfig/20230209-094159-root.json
[09:42:35] <wikibugs>	 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) False alert has still been reported today in (Var...
[09:43:28] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[09:43:41] <wikibugs>	 (03PS4) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774
[09:45:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (if the key has been validated via some out-of-band channel)" [puppet] - 10https://gerrit.wikimedia.org/r/887937 (https://phabricator.wikimedia.org/T328787) (owner: 10Filippo Giunchedi)
[09:47:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T328817)', diff saved to https://phabricator.wikimedia.org/P43958 and previous config saved to /var/cache/conftool/dbconfig/20230209-094755-marostegui.json
[09:47:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[09:47:59] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[09:48:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix cloudvirt-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/887939
[09:48:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[09:48:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43959 and previous config saved to /var/cache/conftool/dbconfig/20230209-094816-marostegui.json
[09:48:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the quick review, I'll validate the key with Kwaku later today and then merge" [puppet] - 10https://gerrit.wikimedia.org/r/887937 (https://phabricator.wikimedia.org/T328787) (owner: 10Filippo Giunchedi)
[09:49:36] <wikibugs>	 (03Merged) 10jenkins-bot: ComputedUserImpactLookup: Reduce logspam for page view rate limiting [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) (owner: 10Kosta Harlan)
[09:49:59] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:887848|ComputedUserImpactLookup: Reduce logspam for page view rate limiting (T328945)]]
[09:50:03] <stashbot>	 T328945: An earlier attempt to fetch page {page title} failed. To limit server load, retries have been blocked for 30 minutes. - https://phabricator.wikimedia.org/T328945
[09:50:10] <wikibugs>	 (03PS5) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774
[09:51:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto)
[09:51:50] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:887848|ComputedUserImpactLookup: Reduce logspam for page view rate limiting (T328945)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[09:51:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43960 and previous config saved to /var/cache/conftool/dbconfig/20230209-095153-marostegui.json
[09:53:39] <wikibugs>	 (03PS5) 10Clément Goubert: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[09:53:41] <wikibugs>	 (03PS6) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774
[09:54:57] <wikibugs>	 (03PS25) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767)
[09:56:15] <wikibugs>	 (03PS4) 10FNegri: Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797
[09:56:18] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "This was a hard one because of the diff 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond)
[09:57:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43961 and previous config saved to /var/cache/conftool/dbconfig/20230209-095659-root.json
[09:57:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43962 and previous config saved to /var/cache/conftool/dbconfig/20230209-095704-root.json
[09:58:06] <wikibugs>	 (03CR) 10Elukey: "Ready for another review :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:59:06] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:887848|ComputedUserImpactLookup: Reduce logspam for page view rate limiting (T328945)]] (duration: 09m 06s)
[09:59:09] <stashbot>	 T328945: An earlier attempt to fetch page {page title} failed. To limit server load, retries have been blocked for 30 minutes. - https://phabricator.wikimedia.org/T328945
[09:59:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix cloudvirt-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/887939 (owner: 10Muehlenhoff)
[10:01:34] <kostajh>	 !log UTC morning deploys really done
[10:01:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:07] <wikibugs>	 (03PS6) 10Clément Goubert: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[10:03:41] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "Reverted to the state of PS4 after a git-review mishap. Still lgtm." [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[10:03:47] <wikibugs>	 (03CR) 10Jelto: jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto)
[10:05:19] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good to me, one suggestion inline related to the fingerprint validation" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[10:06:20] <wikibugs>	 10SRE, 10Observability-Alerting, 10observability: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10fgiunchedi) AFAICS we can't customize/template the URL karma builds for that link
[10:07:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P43963 and previous config saved to /var/cache/conftool/dbconfig/20230209-100700-marostegui.json
[10:07:17] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Update analytics data purge for webrequest_actor [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal)
[10:10:17] <icinga-wm>	 PROBLEM - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43964 and previous config saved to /var/cache/conftool/dbconfig/20230209-101204-root.json
[10:12:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove installserver role from install1003 [puppet] - 10https://gerrit.wikimedia.org/r/887941 (https://phabricator.wikimedia.org/T327867)
[10:12:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43965 and previous config saved to /var/cache/conftool/dbconfig/20230209-101209-root.json
[10:14:45] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[10:21:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove installserver role from install1003 [puppet] - 10https://gerrit.wikimedia.org/r/887941 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[10:22:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P43966 and previous config saved to /var/cache/conftool/dbconfig/20230209-102206-marostegui.json
[10:25:06] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+1] Update analytics data purge for webrequest_actor [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal)
[10:27:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43967 and previous config saved to /var/cache/conftool/dbconfig/20230209-102709-root.json
[10:27:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43968 and previous config saved to /var/cache/conftool/dbconfig/20230209-102713-root.json
[10:30:03] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] Add jaeger-es-index-cleaner (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[10:31:01] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@2ab6564]: Analytics deploy for 3 druid jobs and webrequest_actor jobs
[10:31:19] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@2ab6564]: Analytics deploy for 3 druid jobs and webrequest_actor jobs (duration: 00m 17s)
[10:32:21] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Update analytics data purge for webrequest_actor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal)
[10:34:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2052.codfw.wmnet with OS bullseye
[10:34:14] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] add_cuc_only_for_read_old_T329203.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887884 (https://phabricator.wikimedia.org/T329203) (owner: 10Marostegui)
[10:34:16] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Adds 'before' directive to docker::network in gitlab runner setup [puppet] - 10https://gerrit.wikimedia.org/r/887843 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney)
[10:34:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] add_cuc_only_for_read_old_T329203.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887884 (https://phabricator.wikimedia.org/T329203) (owner: 10Marostegui)
[10:34:26] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye
[10:34:46] <wikibugs>	 (03Merged) 10jenkins-bot: add_cuc_only_for_read_old_T329203.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887884 (https://phabricator.wikimedia.org/T329203) (owner: 10Marostegui)
[10:35:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance
[10:35:56] <wikibugs>	 (03PS1) 10Majavah: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943
[10:35:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance
[10:36:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T329203)', diff saved to https://phabricator.wikimedia.org/P43970 and previous config saved to /var/cache/conftool/dbconfig/20230209-103604-marostegui.json
[10:36:08] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[10:37:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43971 and previous config saved to /var/cache/conftool/dbconfig/20230209-103712-marostegui.json
[10:37:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance
[10:37:17] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[10:37:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance
[10:37:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T328817)', diff saved to https://phabricator.wikimedia.org/P43972 and previous config saved to /var/cache/conftool/dbconfig/20230209-103733-marostegui.json
[10:38:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T329203)', diff saved to https://phabricator.wikimedia.org/P43973 and previous config saved to /var/cache/conftool/dbconfig/20230209-103819-marostegui.json
[10:38:21] <moritzm>	 !log installing containerd security updates
[10:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39478/console" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[10:42:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T328817)', diff saved to https://phabricator.wikimedia.org/P43974 and previous config saved to /var/cache/conftool/dbconfig/20230209-104208-marostegui.json
[10:42:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43975 and previous config saved to /var/cache/conftool/dbconfig/20230209-104214-root.json
[10:42:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43976 and previous config saved to /var/cache/conftool/dbconfig/20230209-104218-root.json
[10:47:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10MoritzMuehlenhoff) I've reset the Netbox status from Failed to Active.
[10:48:03] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, let me know how it goes" [puppet] - 10https://gerrit.wikimedia.org/r/887789 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott)
[10:48:53] <wikibugs>	 (03CR) 10Volans: "much better! replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[10:50:21] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2052.codfw.wmnet with reason: host reimage
[10:52:52] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2052.codfw.wmnet with reason: host reimage
[10:53:01] <wikibugs>	 10SRE: add Hal Triedman (htriedman) to ops-l mailing list - https://phabricator.wikimedia.org/T329209 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thank you for reaching out @Htriedman ! Sign up is self-service here (list owners will need to approve the request) https://lists.wikimedia.org/postorius/lis...
[10:53:04] <wikibugs>	 (03CR) 10JMeybohm: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[10:53:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P43977 and previous config saved to /var/cache/conftool/dbconfig/20230209-105325-marostegui.json
[10:53:57] <wikibugs>	 (03PS1) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945
[10:55:35] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner1003.eqiad.wmnet with OS bullseye
[10:57:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P43978 and previous config saved to /var/cache/conftool/dbconfig/20230209-105714-marostegui.json
[10:57:58] <wikibugs>	 (03PS5) 10FNegri: Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797
[10:58:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: opensearch: reverse-proxy access to opensearch API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[10:58:37] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@dff3f3b]: Fix analytics webrequest_actor_metrics_rollup sensor
[10:58:47] <wikibugs>	 (03CR) 10FNegri: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[10:58:51] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@dff3f3b]: Fix analytics webrequest_actor_metrics_rollup sensor (duration: 00m 13s)
[10:59:00] <wikibugs>	 (03PS1) 10Nicolas Fraison: fix(varnishkafka): add alert duration of 5m to avoid false positive [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522)
[10:59:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi)
[10:59:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on puppetdb2003.codfw.wmnet with reason: master is being reimaged
[10:59:42] <wikibugs>	 (03PS1) 10Marostegui: drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260)
[10:59:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on puppetdb2003.codfw.wmnet with reason: master is being reimaged
[10:59:54] <wikibugs>	 (03PS2) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945
[11:00:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b1f3dbef-467c-49de-8608-5ba564efbe81) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[11:00:05] <jouncebot>	 mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100).
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100)
[11:00:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi)
[11:01:36] <wikibugs>	 (03CR) 10Ladsgroup: drop_cuc_comment_T329260.py: New schema change (032 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui)
[11:02:43] <wikibugs>	 (03PS2) 10Marostegui: drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260)
[11:02:48] <effie>	 !log powercycle mc-gp1001
[11:02:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:01] <wikibugs>	 (03CR) 10Marostegui: drop_cuc_comment_T329260.py: New schema change (032 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui)
[11:04:08] <wikibugs>	 (03CR) 10Ladsgroup: drop_cuc_comment_T329260.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui)
[11:04:59] <wikibugs>	 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, and 2 others: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) From those graph we can see that no requests have been received on the varnish which leads to no...
[11:05:45] <wikibugs>	 (03CR) 10Ladsgroup: drop_cuc_comment_T329260.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui)
[11:05:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi)
[11:06:17] <wikibugs>	 (03PS3) 10Marostegui: drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260)
[11:06:40] <wikibugs>	 (03PS3) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945
[11:06:42] <wikibugs>	 (03CR) 10Marostegui: drop_cuc_comment_T329260.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui)
[11:07:25] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage
[11:08:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Reset puppetdb1003/2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/887971 (https://phabricator.wikimedia.org/T321783)
[11:08:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P43979 and previous config saved to /var/cache/conftool/dbconfig/20230209-110832-marostegui.json
[11:08:36] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2052.codfw.wmnet with OS bullseye
[11:09:53] <wikibugs>	 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, and 2 others: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) The drop is indeed due to a depool  ` 09:09  <vgutierrez>  pool cp4044 with ESI testing enabled...
[11:10:33] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage
[11:11:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Reset puppetdb1003/2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/887971 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[11:11:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi)
[11:12:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P43980 and previous config saved to /var/cache/conftool/dbconfig/20230209-111220-marostegui.json
[11:13:46] <wikibugs>	 (03PS26) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767)
[11:14:13] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, 10User-jbond: Netbox: use the netbox to upet sync to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond)
[11:14:29] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to upet sync to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) p:05Triage→03Medium
[11:15:09] <wikibugs>	 (03CR) 10JMeybohm: Add sre.k8s.upgrade-cluster (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:16:01] <wikibugs>	 (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:16:02] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10jbond) 05Open→03Resolved This is now completed
[11:16:05] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond)
[11:16:21] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10taavi)
[11:16:30] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to upet sync to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) We should also see if we can use the same scripts/data to opulate https://gerrit.wikimedia.org/r/c/opera...
[11:16:43] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond)
[11:17:40] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Puppet class systemd needs to throw a more useful error - https://phabricator.wikimedia.org/T195553 (10taavi) 05Open→03Invalid Boldly closing as we're fully on systemd.
[11:19:35] <wikibugs>	 (03CR) 10Volans: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[11:20:08] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc-gp1001.eqiad.wmnet with OS bullseye
[11:20:58] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye
[11:23:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T329203)', diff saved to https://phabricator.wikimedia.org/P43981 and previous config saved to /var/cache/conftool/dbconfig/20230209-112338-marostegui.json
[11:23:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance
[11:23:42] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[11:23:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance
[11:23:56] <wikibugs>	 (03PS27) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767)
[11:24:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T329203)', diff saved to https://phabricator.wikimedia.org/P43982 and previous config saved to /var/cache/conftool/dbconfig/20230209-112359-marostegui.json
[11:24:11] <wikibugs>	 (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:25:12] <wikibugs>	 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Aklapper) @Dzahn: Did you have any luck getting a reply?
[11:27:10] <wikibugs>	 10SRE, 10DNS, 10Infrastructure-Foundations: Reverse DNS missing for some hosts - https://phabricator.wikimedia.org/T251522 (10Aklapper) @Reedy: ping?
[11:27:11] <wikibugs>	 (03CR) 10Muehlenhoff: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[11:27:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T328817)', diff saved to https://phabricator.wikimedia.org/P43983 and previous config saved to /var/cache/conftool/dbconfig/20230209-112727-marostegui.json
[11:27:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[11:27:31] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[11:27:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[11:27:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[11:27:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43984 and previous config saved to /var/cache/conftool/dbconfig/20230209-112748-marostegui.json
[11:28:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff)
[11:29:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) All install servers are running Bullseye now, the only missing bit is to remove the old VMs.
[11:29:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T329203)', diff saved to https://phabricator.wikimedia.org/P43985 and previous config saved to /var/cache/conftool/dbconfig/20230209-112927-marostegui.json
[11:29:31] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[11:30:29] <wikibugs>	 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Sadads) You can close this ticket. Its been resolved.
[11:31:07] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1003.eqiad.wmnet with OS bullseye
[11:31:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43986 and previous config saved to /var/cache/conftool/dbconfig/20230209-113125-marostegui.json
[11:32:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [WIP] Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi)
[11:33:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto)
[11:34:21] <marostegui>	 !log Stop mariadb on db1098 (s6 and s7) T329171
[11:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:24] <stashbot>	 T329171: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171
[11:38:43] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[11:39:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:40:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb1003.eqiad.wmnet with OS bullseye
[11:41:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bullseye
[11:42:09] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976
[11:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:44:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff)
[11:44:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:44:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P43988 and previous config saved to /var/cache/conftool/dbconfig/20230209-114434-marostegui.json
[11:45:43] <wikibugs>	 (03PS1) 10Aklapper: Remove redirect for pk.wikimedia.org (Pakistan) [puppet] - 10https://gerrit.wikimedia.org/r/887980 (https://phabricator.wikimedia.org/T328596)
[11:46:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P43989 and previous config saved to /var/cache/conftool/dbconfig/20230209-114632-marostegui.json
[11:47:05] <wikibugs>	 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10FRomeo_WMF) 05Open→03Resolved The Google Group was set up (thanks) and we just refreshed the membership and management.
[11:47:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:48:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:48:40] <wikibugs>	 (03PS1) 10JMeybohm: k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943)
[11:49:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[11:50:56] <wikibugs>	 (03PS1) 10Volans: Add Makefile.deploy for the deploy cookbook [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982
[11:51:24] <wikibugs>	 (03PS1) 10EoghanGaffney: Try running docker before the base firewall rules are added [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035)
[11:51:26] <wikibugs>	 (03CR) 10Volans: "This can be compared with the one present in homer:" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 (owner: 10Volans)
[11:52:01] <wikibugs>	 (03PS2) 10Ladsgroup: Migrate Babel config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887307 (https://phabricator.wikimedia.org/T308932)
[11:52:05] <logmsgbot>	 !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc-gp1001.eqiad.wmnet with OS bullseye
[11:52:11] <Amir1>	 jouncebot: nowandnext
[11:52:11] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100)
[11:52:11] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100)
[11:52:11] <jouncebot>	 In 2 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1400)
[11:52:11] <jouncebot>	 In 2 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1400)
[11:52:29] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye
[11:52:35] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39479/console" [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[11:53:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage
[11:53:34] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Migrate Babel config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887307 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[11:54:32] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate Babel config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887307 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[11:55:05] <wikibugs>	 (03CR) 10Volans: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff)
[11:55:11] <wikibugs>	 10SRE-tools, 10Discovery-Search, 10Elasticsearch, 10Infrastructure-Foundations, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775 (10jbond) >>! In T328775#8599015, @bking wrote: > Thanks @jbond  ! Looking at the Spicerack chang...
[11:55:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage
[11:57:23] <logmsgbot>	 !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc-gp1001.eqiad.wmnet with OS bullseye
[11:57:41] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye
[11:57:48] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[11:58:08] <wikibugs>	 (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[11:58:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:59:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P43990 and previous config saved to /var/cache/conftool/dbconfig/20230209-115940-marostegui.json
[12:01:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P43991 and previous config saved to /var/cache/conftool/dbconfig/20230209-120138-marostegui.json
[12:02:12] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on an-worker1098.eqiad.wmnet with reason: Attempting to move some GPUs
[12:02:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:02:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-worker1098.eqiad.wmnet with reason: Attempting to move some GPUs
[12:02:32] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on an-worker1099.eqiad.wmnet with reason: Attempting to move some GPUs
[12:02:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=49b5d5ab-a254-46d1-b90a-001be80f1...
[12:02:46] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-worker1099.eqiad.wmnet with reason: Attempting to move some GPUs
[12:02:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2de0632f-155c-4404-88de-ffa2c986c...
[12:03:08] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/ext-Babel.php: Move Babel settings from IS.php to ext-Babel.php, part I (T308932) (duration: 07m 06s)
[12:03:12] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[12:06:29] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: Attempting to move some GPUs
[12:06:43] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: Attempting to move some GPUs
[12:06:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ac0799f5-49e4-45fd-99e4-a3048068d...
[12:10:16] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Move Babel settings from IS.php to ext-Babel.php, part II (T308932) (duration: 06m 40s)
[12:10:19] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[12:10:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetdb1003.eqiad.wmnet with OS bullseye
[12:10:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bullseye completed: - puppetd...
[12:12:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb2003.codfw.wmnet with OS bullseye
[12:12:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetdb2003.codfw.wmnet with OS bullseye
[12:13:36] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976
[12:13:57] <wikibugs>	 (03CR) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff)
[12:14:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T329203)', diff saved to https://phabricator.wikimedia.org/P43992 and previous config saved to /var/cache/conftool/dbconfig/20230209-121446-marostegui.json
[12:14:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance
[12:14:50] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[12:15:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance
[12:15:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T329203)', diff saved to https://phabricator.wikimedia.org/P43993 and previous config saved to /var/cache/conftool/dbconfig/20230209-121507-marostegui.json
[12:15:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff)
[12:16:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43994 and previous config saved to /var/cache/conftool/dbconfig/20230209-121644-marostegui.json
[12:16:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance
[12:16:48] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[12:16:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance
[12:17:04] <logmsgbot>	 !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc-gp1001.eqiad.wmnet with OS bullseye
[12:17:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T328817)', diff saved to https://phabricator.wikimedia.org/P43995 and previous config saved to /var/cache/conftool/dbconfig/20230209-121705-marostegui.json
[12:17:27] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move Babel settings from IS.php to ext-Babel.php, part III (T308932) (duration: 06m 47s)
[12:17:30] <wikibugs>	 (03PS3) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976
[12:17:30] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[12:18:22] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: reformat templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/887991 (https://phabricator.wikimedia.org/T329049)
[12:18:48] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS buster
[12:19:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T328817)', diff saved to https://phabricator.wikimedia.org/P43996 and previous config saved to /var/cache/conftool/dbconfig/20230209-121923-marostegui.json
[12:19:41] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:19:42] <wikibugs>	 (03CR) 10Jbond: "fly by comment" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[12:19:48] <wikibugs>	 (03CR) 10Hnowlan: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum)
[12:20:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T329203)', diff saved to https://phabricator.wikimedia.org/P43997 and previous config saved to /var/cache/conftool/dbconfig/20230209-122036-marostegui.json
[12:20:40] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[12:21:29] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:17] <logmsgbot>	 !log phedenskog@deploy1002 Started deploy [performance/navtiming@bb224a1]: (no justification provided)
[12:22:25] <logmsgbot>	 !log phedenskog@deploy1002 Finished deploy [performance/navtiming@bb224a1]: (no justification provided) (duration: 00m 08s)
[12:22:34] <wikibugs>	 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, and 2 others: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) I've looked back at the alerts we have faced on the 7th morning and those ones where due to a rol...
[12:23:31] <wikibugs>	 (03PS7) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774
[12:23:40] <wikibugs>	 (03CR) 10Jbond: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[12:27:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb2003.codfw.wmnet with reason: host reimage
[12:28:21] <wikibugs>	 (03PS1) 10Btullis: Update the kubectl config files generated for the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635)
[12:29:45] <wikibugs>	 (03PS2) 10JMeybohm: k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943)
[12:30:23] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[12:31:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:31:21] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39480/console" [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) (owner: 10Btullis)
[12:31:58] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage
[12:32:14] <wikibugs>	 (03CR) 10Btullis: Update the kubectl config files generated for the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) (owner: 10Btullis)
[12:32:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb2003.codfw.wmnet with reason: host reimage
[12:33:43] <wikibugs>	 (03CR) 10JMeybohm: "/cc Jesse - I think cfssl-issuer in aux is a copy-pase from dse? If so, it can/should be removed as well" [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) (owner: 10Btullis)
[12:33:48] <wikibugs>	 (03PS3) 10Jbond: rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996
[12:33:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond)
[12:34:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P43998 and previous config saved to /var/cache/conftool/dbconfig/20230209-123430-marostegui.json
[12:34:56] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage
[12:35:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rotate-snmp: convert to cookbook classes and use secrets for passwords (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond)
[12:35:36] <wikibugs>	 (03Merged) 10jenkins-bot: rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond)
[12:35:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P43999 and previous config saved to /var/cache/conftool/dbconfig/20230209-123542-marostegui.json
[12:37:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:38:44] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui)
[12:39:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui)
[12:39:19] <wikibugs>	 (03PS1) 10Jbond: sre.pdus: correctly pass down doc string to arg parse method [cookbooks] - 10https://gerrit.wikimedia.org/r/887995
[12:39:43] <wikibugs>	 (03Merged) 10jenkins-bot: drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui)
[12:41:24] <wikibugs>	 (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond)
[12:42:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:42:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/886857 (https://phabricator.wikimedia.org/T329195) (owner: 10Cwhite)
[12:46:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add sre.k8s.upgrade-cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[12:46:14] <wikibugs>	 10SRE-tools, 10Discovery-Search, 10Elasticsearch, 10Infrastructure-Foundations, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775 (10Volans) @bking also keep in mind that for spicerack we use debian packages, so unless we do pa...
[12:46:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetdb2003.codfw.wmnet with OS bullseye
[12:46:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetdb2003.codfw.wmnet with OS bullseye completed: - puppetd...
[12:47:25] <wikibugs>	 (03PS29) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[12:47:38] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "LGTM just one typo" [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond)
[12:47:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[12:48:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[12:48:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[12:48:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44000 and previous config saved to /var/cache/conftool/dbconfig/20230209-124837-ladsgroup.json
[12:48:41] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[12:48:51] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@cf9d978]: Fix analytics pageview_actor_hourly
[12:49:04] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@cf9d978]: Fix analytics pageview_actor_hourly (duration: 00m 13s)
[12:49:26] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff)
[12:49:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P44001 and previous config saved to /var/cache/conftool/dbconfig/20230209-124936-marostegui.json
[12:50:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P44002 and previous config saved to /var/cache/conftool/dbconfig/20230209-125048-marostegui.json
[12:50:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:51:10] <wikibugs>	 (03CR) 10Btullis: fix(varnishkafka): add alert duration of 5m to avoid false positive (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison)
[12:52:47] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1001.eqiad.wmnet with OS buster
[12:56:40] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto)
[12:58:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44003 and previous config saved to /var/cache/conftool/dbconfig/20230209-125803-ladsgroup.json
[12:58:07] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[13:02:00] <wikibugs>	 (03CR) 10Btullis: Remove the GPU configuration from an-worker109[67] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis)
[13:04:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[13:04:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T328817)', diff saved to https://phabricator.wikimedia.org/P44004 and previous config saved to /var/cache/conftool/dbconfig/20230209-130442-marostegui.json
[13:04:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[13:04:47] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[13:04:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[13:05:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T328817)', diff saved to https://phabricator.wikimedia.org/P44005 and previous config saved to /var/cache/conftool/dbconfig/20230209-130504-marostegui.json
[13:05:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T329203)', diff saved to https://phabricator.wikimedia.org/P44006 and previous config saved to /var/cache/conftool/dbconfig/20230209-130555-marostegui.json
[13:05:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[13:05:58] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[13:06:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[13:08:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, ping me to merge" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[13:09:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T328817)', diff saved to https://phabricator.wikimedia.org/P44007 and previous config saved to /var/cache/conftool/dbconfig/20230209-130901-marostegui.json
[13:09:39] <wikibugs>	 (03PS1) 10Mazevedo: Add iOS stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697)
[13:09:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[13:10:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[13:10:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T329203)', diff saved to https://phabricator.wikimedia.org/P44008 and previous config saved to /var/cache/conftool/dbconfig/20230209-131010-marostegui.json
[13:10:19] <wikibugs>	 (03PS1) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999
[13:10:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add iOS stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697) (owner: 10Mazevedo)
[13:10:40] <wikibugs>	 (03PS2) 10Nicolas Fraison: fix(varnishkafka): add alert duration of 5m to avoid false positive [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522)
[13:12:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:12:55] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison)
[13:13:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44009 and previous config saved to /var/cache/conftool/dbconfig/20230209-131309-ladsgroup.json
[13:13:52] <wikibugs>	 (03PS2) 10Mazevedo: Add iOS stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697)
[13:14:59] <moritzm>	 !log restarting Exim on MXes to pick up OpenSSL update
[13:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T329203)', diff saved to https://phabricator.wikimedia.org/P44010 and previous config saved to /var/cache/conftool/dbconfig/20230209-131540-marostegui.json
[13:15:44] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[13:16:25] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:44] <wikibugs>	 (03CR) 10Nicolas Fraison: fix(varnishkafka): add alert duration of 5m to avoid false positive (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison)
[13:19:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] phabricator: create phd home directory on service start [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[13:22:29] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:23:50] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10LSobanski) - GitLab failover requires a ~1.5h maintenance window during which GitLab will be unavailable. - We won'...
[13:24:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P44011 and previous config saved to /var/cache/conftool/dbconfig/20230209-132407-marostegui.json
[13:24:15] <wikibugs>	 (03PS2) 10Jbond: sre.pdus: correctly pass down doc string to arg parse method [cookbooks] - 10https://gerrit.wikimedia.org/r/887995
[13:24:52] <wikibugs>	 (03PS4) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976
[13:25:02] <wikibugs>	 (03CR) 10Jbond: "fixed thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond)
[13:25:09] <wikibugs>	 (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff)
[13:26:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond)
[13:27:39] <hashar>	 !log phab2002: manually stopped `phd` service. It can't start due to the MariaDB server being set read-only and failed to start every 10 seconds since forever
[13:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44012 and previous config saved to /var/cache/conftool/dbconfig/20230209-132815-ladsgroup.json
[13:30:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P44013 and previous config saved to /var/cache/conftool/dbconfig/20230209-133046-marostegui.json
[13:32:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM! But hard to mentally parse it all." [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 (owner: 10Volans)
[13:33:56] <wikibugs>	 (03CR) 10JMeybohm: Add a spark-operator chart and helmfile configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[13:39:02] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto)
[13:39:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P44014 and previous config saved to /var/cache/conftool/dbconfig/20230209-133914-marostegui.json
[13:39:30] <wikibugs>	 (03CR) 10Ottomata: "Ah I see still a little WIP?  Ping me when you'd like another review.  Looking good!" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[13:40:15] <elukey>	 !log restart prometheus-statsd-exporter on ores nodes to pick up label change - T325763
[13:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:18] <stashbot>	 T325763: Review ORES traffic to better understand Lift Wing's requirements - https://phabricator.wikimedia.org/T325763
[13:41:29] <wikibugs>	 (03PS1) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807)
[13:43:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44016 and previous config saved to /var/cache/conftool/dbconfig/20230209-134322-ladsgroup.json
[13:43:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[13:43:26] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[13:43:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[13:43:39] <wikibugs>	 (03PS2) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807)
[13:43:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44017 and previous config saved to /var/cache/conftool/dbconfig/20230209-134343-ladsgroup.json
[13:44:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff)
[13:45:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P44018 and previous config saved to /var/cache/conftool/dbconfig/20230209-134553-marostegui.json
[13:47:11] <wikibugs>	 (03PS3) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807)
[13:47:16] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[13:50:18] <wikibugs>	 (03CR) 10Jbond: "lgtm see nits below and grab a +1 from moritz before we merge" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[13:50:20] <wikibugs>	 (03PS4) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945
[13:50:31] <wikibugs>	 (03PS2) 10Jbond: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[13:51:57] <wikibugs>	 (03PS3) 10Jbond: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[13:53:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44019 and previous config saved to /var/cache/conftool/dbconfig/20230209-135309-ladsgroup.json
[13:53:13] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[13:53:27] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@fbebd61]: Update analytics actor dags spark resources
[13:53:41] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@fbebd61]: Update analytics actor dags spark resources (duration: 00m 13s)
[13:53:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39482/console" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[13:54:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T328817)', diff saved to https://phabricator.wikimedia.org/P44020 and previous config saved to /var/cache/conftool/dbconfig/20230209-135420-marostegui.json
[13:54:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[13:54:25] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[13:54:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[13:54:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2429.codfw.wmnet with OS buster
[13:54:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44021 and previous config saved to /var/cache/conftool/dbconfig/20230209-135441-marostegui.json
[13:54:44] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2429.codfw.wmnet with OS buster
[13:55:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi)
[13:57:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44022 and previous config saved to /var/cache/conftool/dbconfig/20230209-135741-marostegui.json
[13:58:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2430.codfw.wmnet with OS buster
[13:58:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2430.codfw.wmnet with OS buster
[13:58:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline re: warning/critical in title" [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison)
[14:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1400)
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1400).
[14:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:32] <Lucas_WMDE>	 I’ll try to test the hacky fix I proposed at https://phabricator.wikimedia.org/T328634#8593132 later in the window
[14:01:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T329203)', diff saved to https://phabricator.wikimedia.org/P44023 and previous config saved to /var/cache/conftool/dbconfig/20230209-140059-marostegui.json
[14:01:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[14:01:03] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[14:01:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "AKAICT this only affects the following classes and nothing in production" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[14:01:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[14:01:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[14:01:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[14:01:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44024 and previous config saved to /var/cache/conftool/dbconfig/20230209-140124-marostegui.json
[14:02:51] <wikibugs>	 (03PS4) 10Majavah: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943
[14:02:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[14:03:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis)
[14:03:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[14:03:27] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Upgrade plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite)
[14:03:44] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:IDM Enable the group creating pipeline. [puppet] - 10https://gerrit.wikimedia.org/r/886331 (owner: 10Slyngshede)
[14:03:51] <wikibugs>	 (03PS5) 10Majavah: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943
[14:04:33] <wikibugs>	 (03CR) 10Majavah: apt::repository: use signed-by instead of apt-key (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[14:06:02] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39483/console" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[14:06:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10SLyngshede-WMF) p:05Triage→03Low
[14:06:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44025 and previous config saved to /var/cache/conftool/dbconfig/20230209-140650-marostegui.json
[14:06:54] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[14:07:31] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:07:39] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:08:12] <Lucas_WMDE>	 alright, I’ll do some testing on mwdebug1001, I hope nobody else is deploying there at the moment :)
[14:08:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44026 and previous config saved to /var/cache/conftool/dbconfig/20230209-140815-ladsgroup.json
[14:08:36] <Lucas_WMDE>	 hrm
[14:09:02] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwdebug1001:~$ mwscript namespaceDupes.php shnwikibooks --fix | tee T328634-1-unpatched.out # T328634 – finished successfully, to my surprise
[14:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:06] <stashbot>	 T328634: Lost pages after deployed addtional namespaces on  shn.wikibooks - https://phabricator.wikimedia.org/T328634
[14:09:07] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:09:13] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:09:21] <wikibugs>	 (03PS1) 10Jelto: prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972)
[14:10:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.pdus: correctly pass down doc string to arg parse method [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond)
[14:10:16] <wikibugs>	 (03PS3) 10Jbond: sre.pdus: correctly pass down doc string to arg parse method [cookbooks] - 10https://gerrit.wikimedia.org/r/887995
[14:11:17] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39484/console" [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto)
[14:12:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P44027 and previous config saved to /var/cache/conftool/dbconfig/20230209-141247-marostegui.json
[14:13:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah)
[14:14:00] <dcausse>	 !log T329089: re-playing detected inconsistencies (missing mediawiki.page-undelete events) from 2022-10-31 to 2023-02-07 to WDQS
[14:14:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:08] <stashbot>	 T329089: The rdf-streaming-updater does not reconcile missed page-undelete events - https://phabricator.wikimedia.org/T329089
[14:14:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2429.codfw.wmnet with reason: host reimage
[14:14:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS buster
[14:14:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster
[14:15:00] <Lucas_WMDE>	 alright, I’m done with my testing (and didn’t even end up editing any files)
[14:16:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm, let's see if it works :)" [puppet] - 10https://gerrit.wikimedia.org/r/887872 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:16:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr)
[14:17:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2429.codfw.wmnet with reason: host reimage
[14:17:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr) removed gpu from an-worker1098, an-worker1099.  installed both gpu into dse-k8s-worker...
[14:19:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2430.codfw.wmnet with reason: host reimage
[14:21:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2432.codfw.wmnet with OS buster
[14:21:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2432.codfw.wmnet with OS buster
[14:21:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P44028 and previous config saved to /var/cache/conftool/dbconfig/20230209-142157-marostegui.json
[14:22:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2430.codfw.wmnet with reason: host reimage
[14:23:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44029 and previous config saved to /var/cache/conftool/dbconfig/20230209-142321-ladsgroup.json
[14:24:55] <wikibugs>	 (03CR) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff)
[14:25:57] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@dc3cd56]: T329089: proper reconciliation of missed page-undelete events
[14:25:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: dse-k8s-worker1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:26:02] <stashbot>	 T329089: The rdf-streaming-updater does not reconcile missed page-undelete events - https://phabricator.wikimedia.org/T329089
[14:26:27] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:27:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:27:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) 05Open→03Resolved Great! Thanks @Jclark-ctr both cards detected. ` btullis@dse-k8s-wo...
[14:27:30] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2053.codfw.wmnet with OS bullseye
[14:27:45] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the kubectl config files generated for the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) (owner: 10Btullis)
[14:27:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P44030 and previous config saved to /var/cache/conftool/dbconfig/20230209-142754-marostegui.json
[14:27:59] <wikibugs>	 (03PS5) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945
[14:29:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppet: adapt replica_cnf_api to python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/887872 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:31:03] <wikibugs>	 (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[14:31:27] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.703 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:31:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Verification happened on meet" [puppet] - 10https://gerrit.wikimedia.org/r/887937 (https://phabricator.wikimedia.org/T328787) (owner: 10Filippo Giunchedi)
[14:31:53] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:31:57] <wikibugs>	 (03PS2) 10Btullis: Remove the GPU configuration from an-worker109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696)
[14:32:12] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Nice! It's ready to be tested, just one inverted check to fix." [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[14:32:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[14:32:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi)
[14:33:43] <wikibugs>	 (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney)
[14:34:15] <wikibugs>	 (03PS28) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767)
[14:34:22] <wikibugs>	 (03PS4) 10Raymond Ndibe: puppet: modify role::wmcs::nfs::primary for replica_cnf api [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663)
[14:34:55] <wikibugs>	 (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[14:35:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "NICE!!! Let's start testing it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[14:35:45] <wikibugs>	 (03PS2) 10EoghanGaffney: Try running docker before the base firewall rules are added [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035)
[14:36:08] <wikibugs>	 (03CR) 10EoghanGaffney: Try running docker before the base firewall rules are added (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney)
[14:37:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P44031 and previous config saved to /var/cache/conftool/dbconfig/20230209-143704-marostegui.json
[14:37:07] <wikibugs>	 (03CR) 10Volans: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[14:37:52] <wikibugs>	 (03PS6) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)
[14:38:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44032 and previous config saved to /var/cache/conftool/dbconfig/20230209-143828-ladsgroup.json
[14:38:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:38:32] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[14:38:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF)
[14:38:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:38:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[14:39:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Initial IDM puppetisation - https://phabricator.wikimedia.org/T320428 (10SLyngshede-WMF) 05In progress→03Resolved Initial work is done, but is to come down the line.
[14:39:21] <wikibugs>	 (03CR) 10Volans: cookbooks.sre.elasticsearch.restart-nginx: New cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff)
[14:39:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405 (10SLyngshede-WMF)
[14:39:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF All sub-tasks are now closed.
[14:40:02] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933)
[14:40:15] <wikibugs>	 10SRE, 10Data-Persistence, 10Discovery-Search, 10serviceops, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) >>! In T327920#8570661, @bd808 wrote: > #Toolhub does not have a working Kubernetes deployment outside of eqiad ({T28...
[14:41:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2432.codfw.wmnet with reason: host reimage
[14:42:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for the context in the commit message!" [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto)
[14:43:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44033 and previous config saved to /var/cache/conftool/dbconfig/20230209-144300-marostegui.json
[14:43:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[14:43:04] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[14:43:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[14:43:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[14:43:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T328817)', diff saved to https://phabricator.wikimedia.org/P44034 and previous config saved to /var/cache/conftool/dbconfig/20230209-144321-marostegui.json
[14:43:39] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2053.codfw.wmnet with reason: host reimage
[14:44:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2432.codfw.wmnet with reason: host reimage
[14:44:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppet: modify role::wmcs::nfs::primary for replica_cnf api [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[14:44:44] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:44:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[14:44:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2429.codfw.wmnet with OS buster
[14:44:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2429.codfw.wmnet with OS buster completed: - mw2429 (**PASS**)   - Removed from Pupp...
[14:44:56] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[14:44:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2430.codfw.wmnet with OS buster
[14:44:59] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:45:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2430.codfw.wmnet with OS buster completed: - mw2430 (**PASS**)   - Removed from Pupp...
[14:45:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[14:45:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[14:45:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44035 and previous config saved to /var/cache/conftool/dbconfig/20230209-144535-ladsgroup.json
[14:45:39] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[14:46:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for SSH Access for kofori - https://phabricator.wikimedia.org/T328787 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @KOfori change is live and you should have full access in ~20 min. The bastions will be accessible already. See also https://wi...
[14:46:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2433.codfw.wmnet with OS buster
[14:46:45] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@dc3cd56]: T329089: proper reconciliation of missed page-undelete events (duration: 20m 48s)
[14:46:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2433.codfw.wmnet with OS buster
[14:46:48] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2053.codfw.wmnet with reason: host reimage
[14:46:48] <stashbot>	 T329089: The rdf-streaming-updater does not reconcile missed page-undelete events - https://phabricator.wikimedia.org/T329089
[14:46:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2434.codfw.wmnet with OS buster
[14:47:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2434.codfw.wmnet with OS buster
[14:48:55] <wikibugs>	 (03CR) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[14:49:15] <wikibugs>	 (03PS7) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)
[14:49:56] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1001.eqiad.wmnet']
[14:50:51] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1001.eqiad.wmnet']
[14:51:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1001.eqiad.wmnet']
[14:51:49] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1001.eqiad.wmnet']
[14:52:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:52:03] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:52:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44036 and previous config saved to /var/cache/conftool/dbconfig/20230209-145210-marostegui.json
[14:52:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:52:14] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[14:52:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:52:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44037 and previous config saved to /var/cache/conftool/dbconfig/20230209-145232-marostegui.json
[14:52:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:52:39] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:54:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[14:55:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44038 and previous config saved to /var/cache/conftool/dbconfig/20230209-145513-ladsgroup.json
[14:55:17] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[14:55:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:56:11] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:56:33] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:57:01] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:57:22] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:58:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44039 and previous config saved to /var/cache/conftool/dbconfig/20230209-145811-marostegui.json
[14:58:15] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[14:58:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "we got impatient :)" [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[14:58:40] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[14:59:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:02:53] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) While not directly linked to the switchover as it does not have a codfw deployment, Toolhub will p...
[15:03:16] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2053.codfw.wmnet with OS bullseye
[15:03:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: restore alert history feature on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/888027 (https://phabricator.wikimedia.org/T329294)
[15:04:44] <wikibugs>	 (03CR) 10Jbond: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[15:04:55] <wikibugs>	 (03CR) 10Btullis: analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933) (owner: 10Mforns)
[15:04:59] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[15:05:25] <wikibugs>	 (03CR) 10Ayounsi: "Looking more into is, the Calico configuration knob `keepOriginalNextHop` implemented in https://github.com/projectcalico/libcalico-go/pul" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) (owner: 10Alexandros Kosiaris)
[15:06:18] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM for v0" [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[15:06:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2434.codfw.wmnet with reason: host reimage
[15:07:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 (owner: 10Volans)
[15:07:17] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2054.codfw.wmnet with OS bullseye
[15:08:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:08:29] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2432.codfw.wmnet with OS buster
[15:08:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2432.codfw.wmnet with OS buster completed: - mw2432 (**PASS**)   - Removed from Pupp...
[15:09:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2435.codfw.wmnet with OS buster
[15:09:10] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2435.codfw.wmnet with OS buster
[15:09:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm lets give it  shot" [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney)
[15:09:30] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2434.codfw.wmnet with reason: host reimage
[15:09:35] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: restore alert history feature on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/888027 (https://phabricator.wikimedia.org/T329294) (owner: 10Filippo Giunchedi)
[15:10:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44040 and previous config saved to /var/cache/conftool/dbconfig/20230209-151019-ladsgroup.json
[15:10:49] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2431.codfw.wmnet with OS buster
[15:11:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster executed with errors: - mw2431 (**FAIL**)   - Remove...
[15:11:47] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "cheers thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/887804 (owner: 10Herron)
[15:12:14] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[15:13:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P44041 and previous config saved to /var/cache/conftool/dbconfig/20230209-151317-marostegui.json
[15:16:15] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[15:17:29] <icinga-wm>	 PROBLEM - Check systemd state on graphite2004 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:32] <wikibugs>	 (03CR) 10Nicolas Fraison: fix(varnishkafka): add alert duration of 5m to avoid false positive (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison)
[15:22:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Santiago Faci - https://phabricator.wikimedia.org/T329296 (10Sfaci)
[15:22:18] <wikibugs>	 (03PS1) 10Hnowlan: Pin setuptools and packaging versions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290)
[15:22:24] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: pupetmastrs: investigate if the puppetmasteres still need a checkout of operations/software - https://phabricator.wikimedia.org/T329297 (10jbond) p:05Triage→03Medium
[15:23:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:23:28] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2054.codfw.wmnet with reason: host reimage
[15:23:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Pin setuptools and packaging versions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) (owner: 10Hnowlan)
[15:23:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: restore alert history feature on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/888027 (https://phabricator.wikimedia.org/T329294) (owner: 10Filippo Giunchedi)
[15:24:20] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet
[15:25:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44042 and previous config saved to /var/cache/conftool/dbconfig/20230209-152525-ladsgroup.json
[15:25:54] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2054.codfw.wmnet with reason: host reimage
[15:28:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P44043 and previous config saved to /var/cache/conftool/dbconfig/20230209-152824-marostegui.json
[15:31:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison)
[15:31:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:31:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2434.codfw.wmnet with OS buster
[15:31:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2434.codfw.wmnet with OS buster completed: - mw2434 (**PASS**)   - Removed from Pupp...
[15:34:18] <wikibugs>	 (03PS2) 10Mforns: analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933)
[15:34:34] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet
[15:34:34] <logmsgbot>	 !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts mc-gp1001.eqiad.wmnet
[15:34:35] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10User-MarcoAurelio: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10MarcoAurelio) 05Open→03Resolved a:03MarcoAurelio This was fixed in [[ https://gitlab.com/mailman/django-mailman3/-/commit/31c6ae825fa055...
[15:38:25] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10MarcoAurelio) a:05MarcoAurelio→03None
[15:39:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS buster
[15:39:35] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye
[15:39:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster
[15:39:43] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2431.codfw.wmnet with OS buster
[15:39:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster executed with errors: - mw2431 (**FAIL**)   - Remove...
[15:40:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44044 and previous config saved to /var/cache/conftool/dbconfig/20230209-154032-ladsgroup.json
[15:40:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[15:40:35] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[15:40:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[15:40:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[15:40:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[15:40:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44045 and previous config saved to /var/cache/conftool/dbconfig/20230209-154058-ladsgroup.json
[15:41:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS buster
[15:41:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster
[15:41:52] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Perform ESI processing on wiki pages [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799)
[15:41:59] <wikibugs>	 (03PS3) 10JHathaway: Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553)
[15:42:10] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2054.codfw.wmnet with OS bullseye
[15:42:49] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Try running docker before the base firewall rules are added [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney)
[15:42:58] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2433.codfw.wmnet with OS buster
[15:43:02] <wikibugs>	 (03CR) 10JHathaway: Add jaeger-es-index-cleaner (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[15:43:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2433.codfw.wmnet with OS buster executed with errors: - mw2433 (**FAIL**)   - Remove...
[15:43:13] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10MarcoAurelio) Pardon my ignorance but are partial i18n updates possible (e.g. [[ https://gitlab.com/mailman/django-mailman3/-/tree/master/django_mailman3/lo...
[15:43:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44046 and previous config saved to /var/cache/conftool/dbconfig/20230209-154330-marostegui.json
[15:43:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:43:34] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[15:43:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:43:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T328817)', diff saved to https://phabricator.wikimedia.org/P44047 and previous config saved to /var/cache/conftool/dbconfig/20230209-154337-marostegui.json
[15:43:41] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[15:43:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44048 and previous config saved to /var/cache/conftool/dbconfig/20230209-154347-marostegui.json
[15:46:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: add Santiago Faci [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296)
[15:49:08] <wikibugs>	 (03PS1) 10Herron: statsd_proxy: add ipv6only=1 to socat relay config [puppet] - 10https://gerrit.wikimedia.org/r/888046
[15:49:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44049 and previous config saved to /var/cache/conftool/dbconfig/20230209-154919-marostegui.json
[15:49:23] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[15:50:05] <wikibugs>	 (03CR) 10Herron: [C: 03+2] statsd_proxy: add ipv6only=1 to socat relay config [puppet] - 10https://gerrit.wikimedia.org/r/888046 (owner: 10Herron)
[15:50:10] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933) (owner: 10Mforns)
[15:50:17] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto)
[15:50:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44050 and previous config saved to /var/cache/conftool/dbconfig/20230209-155019-ladsgroup.json
[15:50:23] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[15:51:57] <wikibugs>	 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 (10Clement_Goubert) Dashboard available: https://logstash.wikimedia.org/app/dashboards#/view/74557260-a88f-11ed-96bb-4b4732aa077a
[15:52:10] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage
[15:52:21] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933) (owner: 10Mforns)
[15:53:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto)
[15:54:39] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage
[15:54:59] <wikibugs>	 (03PS8) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)
[15:55:17] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2055.codfw.wmnet with OS bullseye
[15:55:24] <sukhe>	 !log restart esitest.service on A:cp-text
[15:55:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:40] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1002.eqiad.wmnet
[15:55:47] <logmsgbot>	 !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1002.eqiad.wmnet
[15:56:02] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1002.eqiad.wmnet
[15:56:03] <wikibugs>	 (03CR) 10Ayounsi: "As we're adding 1.16 vs. 1.23 conditionals, if we merge that, we need to add the relevant cleanups to https://phabricator.wikimedia.org/T3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[15:56:35] <logmsgbot>	 !log jiji@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts mc-gp1002.eqiad.wmnet
[15:58:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P44051 and previous config saved to /var/cache/conftool/dbconfig/20230209-155843-marostegui.json
[16:02:07] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "Builds fine for me" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[16:02:40] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner1004.eqiad.wmnet with OS bullseye
[16:04:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P44052 and previous config saved to /var/cache/conftool/dbconfig/20230209-160425-marostegui.json
[16:05:21] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2435.codfw.wmnet with OS buster
[16:05:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44053 and previous config saved to /var/cache/conftool/dbconfig/20230209-160525-ladsgroup.json
[16:05:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2435.codfw.wmnet with OS buster executed with errors: - mw2435 (**FAIL**)   - Remove...
[16:06:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Can confirm! Builds for me" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[16:07:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296) (owner: 10Filippo Giunchedi)
[16:08:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add Santiago Faci [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296) (owner: 10Filippo Giunchedi)
[16:08:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:09:21] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc2055.codfw.wmnet with OS bullseye
[16:09:41] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2055.codfw.wmnet with OS bullseye
[16:10:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Santiago Faci - https://phabricator.wikimedia.org/T329296 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @Sfaci you are now in the `wmf` LDAP group. I'm optimistically resolving the task, though feel free to reopen if some...
[16:10:30] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1001.eqiad.wmnet with OS bullseye
[16:13:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P44054 and previous config saved to /var/cache/conftool/dbconfig/20230209-161349-marostegui.json
[16:13:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:14:29] <icinga-wm>	 RECOVERY - Check systemd state on graphite2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:14:31] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage
[16:17:01] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage
[16:18:17] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Perform ESI processing on wiki pages [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799)
[16:19:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P44055 and previous config saved to /var/cache/conftool/dbconfig/20230209-161931-marostegui.json
[16:20:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44056 and previous config saved to /var/cache/conftool/dbconfig/20230209-162032-ladsgroup.json
[16:20:55] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[16:25:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2055.codfw.wmnet with reason: host reimage
[16:25:51] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Add Makefile.deploy for the deploy cookbook [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 (owner: 10Volans)
[16:27:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[16:27:25] <wikibugs>	 (03PS29) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767)
[16:28:16] <wikibugs>	 (03PS1) 10Ahmon Dancy: logspam.pl: Filter out some persistent noise [puppet] - 10https://gerrit.wikimedia.org/r/888050 (https://phabricator.wikimedia.org/T323254)
[16:28:23] <wikibugs>	 (03CR) 10Hashar: jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto)
[16:28:34] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2055.codfw.wmnet with reason: host reimage
[16:28:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T328817)', diff saved to https://phabricator.wikimedia.org/P44057 and previous config saved to /var/cache/conftool/dbconfig/20230209-162855-marostegui.json
[16:28:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[16:28:59] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[16:29:02] <wikibugs>	 (03CR) 10Ahmon Dancy: "Tested on mwlog1002" [puppet] - 10https://gerrit.wikimedia.org/r/888050 (https://phabricator.wikimedia.org/T323254) (owner: 10Ahmon Dancy)
[16:29:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[16:29:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44058 and previous config saved to /var/cache/conftool/dbconfig/20230209-162927-marostegui.json
[16:31:16] <wikibugs>	 (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051
[16:31:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:31:30] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39486/console" [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[16:32:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond)
[16:33:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44059 and previous config saved to /var/cache/conftool/dbconfig/20230209-163327-marostegui.json
[16:34:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44060 and previous config saved to /var/cache/conftool/dbconfig/20230209-163438-marostegui.json
[16:34:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance
[16:34:42] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[16:34:46] <wikibugs>	 (03PS3) 10Andrea Denisse: centrallog: Add centrallog1001 to quickdatacopy allow hosts [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778)
[16:34:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance
[16:34:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44061 and previous config saved to /var/cache/conftool/dbconfig/20230209-163459-marostegui.json
[16:35:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44062 and previous config saved to /var/cache/conftool/dbconfig/20230209-163538-ladsgroup.json
[16:35:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[16:35:42] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[16:35:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[16:36:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44063 and previous config saved to /var/cache/conftool/dbconfig/20230209-163559-ladsgroup.json
[16:36:05] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39487/console" [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[16:36:33] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] logspam.pl: Filter out some persistent noise [puppet] - 10https://gerrit.wikimedia.org/r/888050 (https://phabricator.wikimedia.org/T323254) (owner: 10Ahmon Dancy)
[16:36:34] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@caf4808]: T329089: proper reconciliation of missed page-undelete events
[16:36:37] <stashbot>	 T329089: The rdf-streaming-updater does not reconcile missed page-undelete events - https://phabricator.wikimedia.org/T329089
[16:36:47] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite)
[16:37:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44064 and previous config saved to /var/cache/conftool/dbconfig/20230209-163720-marostegui.json
[16:37:22] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2431.codfw.wmnet with OS buster
[16:37:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster executed with errors: - mw2431 (**FAIL**)   - Remove...
[16:38:09] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Perform ESI processing on wiki pages [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[16:38:11] <wikibugs>	 (03PS4) 10Andrea Denisse: centrallog: Enable auto_ferm_ipv6 to quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778)
[16:38:58] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@caf4808]: T329089: proper reconciliation of missed page-undelete events (duration: 02m 24s)
[16:39:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Add safe.directory directives for the puppet master [puppet] - 10https://gerrit.wikimedia.org/r/888053
[16:39:30] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39488/console" [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[16:40:53] <wikibugs>	 (03CR) 10Elukey: services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[16:41:40] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi, I enabled auto_ferm_ipv6 to open the firewall ports and sync the instances." [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[16:43:08] <wikibugs>	 (03PS2) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999
[16:44:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr can I get an update on the situation here / estimate of when we might be able to add the 4 links detailed above?  Ping...
[16:44:27] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2055.codfw.wmnet with OS bullseye
[16:44:52] <moritzm>	 !log installing curl security updates on buster
[16:44:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:06] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff)
[16:45:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff)
[16:45:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44065 and previous config saved to /var/cache/conftool/dbconfig/20230209-164525-ladsgroup.json
[16:45:29] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[16:46:10] <wikibugs>	 (03PS5) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576)
[16:48:29] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1004.eqiad.wmnet with OS bullseye
[16:48:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P44066 and previous config saved to /var/cache/conftool/dbconfig/20230209-164834-marostegui.json
[16:51:16] <wikibugs>	 (03PS3) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999
[16:52:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P44067 and previous config saved to /var/cache/conftool/dbconfig/20230209-165226-marostegui.json
[16:56:18] <jinxer-wm>	 (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:58:08] <icinga-wm>	 RECOVERY - Disk space on an-airflow1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-airflow1001&var-datasource=eqiad+prometheus/ops
[17:00:04] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44068 and previous config saved to /var/cache/conftool/dbconfig/20230209-170031-ladsgroup.json
[17:01:50] <wikibugs>	 (03CR) 10Jbond: "lgtm but may make more senses to go in puppetmaster::gitclone (which its self shuld probably be a profile but that's a different matter)" [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff)
[17:02:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10wiki_willy) a:03Jclark-ctr
[17:03:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P44069 and previous config saved to /var/cache/conftool/dbconfig/20230209-170340-marostegui.json
[17:06:11] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: <type of hardware failure> for <fqdn of server> - https://phabricator.wikimedia.org/T329305 (10wiki_willy)
[17:06:38] <wikibugs>	 (03PS1) 10EoghanGaffney: Insert an empty DOCKER-ISOLATION-STAGE-1 chain into the ferm templates [puppet] - 10https://gerrit.wikimedia.org/r/888057 (https://phabricator.wikimedia.org/T329035)
[17:06:51] <wikibugs>	 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10wiki_willy)
[17:07:19] <moritzm>	 !log rolling restart of FPM/Apache on mw canaries to pick up curl security updates
[17:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P44070 and previous config saved to /var/cache/conftool/dbconfig/20230209-170732-marostegui.json
[17:08:46] <wikibugs>	 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10wiki_willy)
[17:12:22] <wikibugs>	 10SRE: add Hal Triedman (htriedman) to ops-l mailing list - https://phabricator.wikimedia.org/T329209 (10Htriedman) @fgiunchedi I just signed up via lists.wikimedia.org! Thanks for getting back to me.
[17:14:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Add safe.directory directives for the puppet master [puppet] - 10https://gerrit.wikimedia.org/r/888053
[17:14:55] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm to add a empty chain." [puppet] - 10https://gerrit.wikimedia.org/r/888057 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney)
[17:15:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44071 and previous config saved to /var/cache/conftool/dbconfig/20230209-171539-ladsgroup.json
[17:17:31] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff)
[17:18:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44072 and previous config saved to /var/cache/conftool/dbconfig/20230209-171846-marostegui.json
[17:18:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[17:18:50] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[17:19:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[17:20:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[17:21:07] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[17:21:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:21:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:21:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44073 and previous config saved to /var/cache/conftool/dbconfig/20230209-172129-marostegui.json
[17:21:36] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[17:22:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44074 and previous config saved to /var/cache/conftool/dbconfig/20230209-172239-marostegui.json
[17:22:43] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[17:25:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44075 and previous config saved to /var/cache/conftool/dbconfig/20230209-172524-marostegui.json
[17:25:28] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[17:30:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44076 and previous config saved to /var/cache/conftool/dbconfig/20230209-173045-ladsgroup.json
[17:30:48] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-02-06-121917-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/888061
[17:30:49] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[17:31:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[17:32:07] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@e84e692]: (no justification provided)
[17:32:24] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@e84e692]: (no justification provided) (duration: 00m 16s)
[17:33:53] <wikibugs>	 (03Abandoned) 10Andrea Denisse: centrallog: Enable auto_ferm_ipv6 to quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[17:40:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P44077 and previous config saved to /var/cache/conftool/dbconfig/20230209-174030-marostegui.json
[17:41:05] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-02-06-121917-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/888061 (owner: 10BryanDavis)
[17:41:06] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp2001.codfw.wmnet
[17:41:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10AKhatun_WMF) Thank you, accessed!
[17:43:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS buster
[17:44:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster
[17:46:05] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-02-06-121917-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/888061 (owner: 10BryanDavis)
[17:49:09] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T328420 (10wiki_willy) a:03Papaul
[17:50:41] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mc-gp2001.codfw.wmnet
[17:51:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10wiki_willy) a:03Papaul
[17:51:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2433.codfw.wmnet with OS buster
[17:51:31] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2433.codfw.wmnet with OS buster
[17:55:04] <wikibugs>	 (03PS1) 10Herron: rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065
[17:55:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P44078 and previous config saved to /var/cache/conftool/dbconfig/20230209-175536-marostegui.json
[17:57:11] <wikibugs>	 (03PS1) 10MusikAnimal: InitialiseSettings: install PageAssessments on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888066 (https://phabricator.wikimedia.org/T328224)
[18:00:05] <jouncebot>	 bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1800)
[18:00:48] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp2001.codfw.wmnet
[18:00:52] <logmsgbot>	 !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp2001.codfw.wmnet
[18:01:04] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp2001.codfw.wmnet
[18:01:41] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:02:05] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:02:05] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet
[18:02:20] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:03:05] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:03:12] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:03:55] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:04:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2431.codfw.wmnet with reason: host reimage
[18:05:20] <wikibugs>	 (03CR) 10Herron: "Came across these inline lookups while prepping rsync transfers between centrallog hosts.  Proposing we simply get rid of them since permi" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron)
[18:07:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2431.codfw.wmnet with reason: host reimage
[18:08:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2435.codfw.wmnet with OS buster
[18:08:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2435.codfw.wmnet with OS buster
[18:09:15] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset
[18:09:36] <logmsgbot>	 !log jiji@cumin1001 Updating IPMI password on 1 hosts - jiji@cumin1001
[18:09:37] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0)
[18:10:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44079 and previous config saved to /var/cache/conftool/dbconfig/20230209-181043-marostegui.json
[18:10:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[18:10:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2433.codfw.wmnet with reason: host reimage
[18:10:47] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[18:11:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[18:11:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T328817)', diff saved to https://phabricator.wikimedia.org/P44080 and previous config saved to /var/cache/conftool/dbconfig/20230209-181115-marostegui.json
[18:11:59] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet
[18:12:00] <logmsgbot>	 !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts mc-gp2001.codfw.wmnet
[18:13:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T328817)', diff saved to https://phabricator.wikimedia.org/P44081 and previous config saved to /var/cache/conftool/dbconfig/20230209-181353-marostegui.json
[18:13:54] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2433.codfw.wmnet with reason: host reimage
[18:20:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:22:35] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp2001.codfw.wmnet with OS bullseye
[18:26:17] <wikibugs>	 (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[18:28:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2435.codfw.wmnet with reason: host reimage
[18:28:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P44082 and previous config saved to /var/cache/conftool/dbconfig/20230209-182859-marostegui.json
[18:30:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:32:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2435.codfw.wmnet with reason: host reimage
[18:32:54] <wikibugs>	 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10jijiki)
[18:32:58] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:32:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2433.codfw.wmnet with OS buster
[18:33:01] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:33:02] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2431.codfw.wmnet with OS buster
[18:33:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2433.codfw.wmnet with OS buster completed: - mw2433 (**PASS**)   - Removed from Pupp...
[18:33:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster completed: - mw2431 (**PASS**)   - Removed from Pupp...
[18:34:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[18:36:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[18:36:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[18:36:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44083 and previous config saved to /var/cache/conftool/dbconfig/20230209-183611-ladsgroup.json
[18:36:15] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[18:38:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2001.codfw.wmnet with reason: host reimage
[18:40:37] <wikibugs>	 (03CR) 10Jbond: "thanks for the follow ups 😊" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[18:41:02] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2001.codfw.wmnet with reason: host reimage
[18:42:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10jon_amar-WMDE)
[18:44:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P44084 and previous config saved to /var/cache/conftool/dbconfig/20230209-184405-marostegui.json
[18:44:21] <wikibugs>	 (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri)
[18:45:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44085 and previous config saved to /var/cache/conftool/dbconfig/20230209-184538-ladsgroup.json
[18:45:42] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[18:48:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:49:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:49:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2435.codfw.wmnet with OS buster
[18:49:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2435.codfw.wmnet with OS buster completed: - mw2435 (**PASS**)   - Removed from Pupp...
[18:55:58] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2001.codfw.wmnet with OS bullseye
[18:56:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[18:56:55] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Reedy)
[18:59:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T328817)', diff saved to https://phabricator.wikimedia.org/P44086 and previous config saved to /var/cache/conftool/dbconfig/20230209-185912-marostegui.json
[18:59:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[18:59:16] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[18:59:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[18:59:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T328817)', diff saved to https://phabricator.wikimedia.org/P44087 and previous config saved to /var/cache/conftool/dbconfig/20230209-185933-marostegui.json
[19:00:04] <jouncebot>	 ^demon and dancy: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1900).
[19:00:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44088 and previous config saved to /var/cache/conftool/dbconfig/20230209-190044-ladsgroup.json
[19:01:12] <ebernhardson>	 !log start full-cluster in-place reindexing of all wiki elasticsearch clusters T147505
[19:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:15] <stashbot>	 T147505: [tracking] CirrusSearch: what is updated during re-indexing - https://phabricator.wikimedia.org/T147505
[19:02:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T328817)', diff saved to https://phabricator.wikimedia.org/P44089 and previous config saved to /var/cache/conftool/dbconfig/20230209-190211-marostegui.json
[19:15:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44090 and previous config saved to /var/cache/conftool/dbconfig/20230209-191551-ladsgroup.json
[19:17:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P44091 and previous config saved to /var/cache/conftool/dbconfig/20230209-191717-marostegui.json
[19:18:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr)  @cmooney sorry for delay finished connecting links and updated cableid's
[19:30:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44092 and previous config saved to /var/cache/conftool/dbconfig/20230209-193057-ladsgroup.json
[19:30:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[19:31:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[19:31:02] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[19:31:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44093 and previous config saved to /var/cache/conftool/dbconfig/20230209-193107-ladsgroup.json
[19:32:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P44094 and previous config saved to /var/cache/conftool/dbconfig/20230209-193223-marostegui.json
[19:36:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm)
[19:40:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44095 and previous config saved to /var/cache/conftool/dbconfig/20230209-194032-ladsgroup.json
[19:40:37] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[19:41:31] <wikibugs>	 (03PS4) 10Bking: elastic: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335)
[19:42:30] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[19:46:48] <wikibugs>	 (03PS5) 10Bking: elastic relforge: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335)
[19:47:04] <wikibugs>	 (03PS6) 10Ryan Kemper: elastic relforge: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[19:47:09] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] elastic relforge: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[19:47:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T328817)', diff saved to https://phabricator.wikimedia.org/P44096 and previous config saved to /var/cache/conftool/dbconfig/20230209-194730-marostegui.json
[19:47:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:47:34] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[19:47:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:47:47] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic relforge: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[19:55:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44097 and previous config saved to /var/cache/conftool/dbconfig/20230209-195539-ladsgroup.json
[19:57:00] <wikibugs>	 (03PS1) 10Bking: elastic relforge: update logstash transport [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335)
[20:03:55] <wikibugs>	 (03PS2) 10Bking: elastic relforge: update logstash transport [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335)
[20:05:08] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[20:05:51] <wikibugs>	 (03PS3) 10Bking: elastic relforge: update logstash transport [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335)
[20:07:29] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[20:10:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44098 and previous config saved to /var/cache/conftool/dbconfig/20230209-201045-ladsgroup.json
[20:12:11] <wikibugs>	 (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39490/console" [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[20:17:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-cephleaks.py: add 'delete' functionality [puppet] - 10https://gerrit.wikimedia.org/r/887789 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott)
[20:19:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10WMDE-leszek) I endorse the request on WMDE's behalf, and confirm the identity of @jon_amar-WMDE.
[20:20:40] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933)
[20:22:06] <wikibugs>	 (03PS2) 10Mforns: analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933)
[20:22:18] <wikibugs>	 (03PS3) 10Mforns: analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933)
[20:23:15] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "PCC looks as expected" [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[20:23:25] <wikibugs>	 (03CR) 10Bking: [V: 03+1 C: 03+2] elastic relforge: update logstash transport [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[20:25:42] <wikibugs>	 (03PS4) 10Mforns: analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933)
[20:25:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44099 and previous config saved to /var/cache/conftool/dbconfig/20230209-202551-ladsgroup.json
[20:25:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[20:25:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[20:25:57] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[20:27:14] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335
[20:27:18] <stashbot>	 T324335: Remove logstash from the Search Elasticsearch servers - https://phabricator.wikimedia.org/T324335
[20:31:01] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335
[20:32:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[20:32:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[20:32:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44100 and previous config saved to /var/cache/conftool/dbconfig/20230209-203236-ladsgroup.json
[20:32:40] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[20:34:08] <wikibugs>	 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10Wikimedia-production-error: Steady rate of Phonos Swift errors (inc. DescribeFileOp failed, FileBackendStore::ingestFreshFileStats: Could not stat) - https://phabricator.wikimedia.org/T329249 (10Aklapper)
[20:42:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44101 and previous config saved to /var/cache/conftool/dbconfig/20230209-204214-ladsgroup.json
[20:42:18] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[20:47:01] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335
[20:47:04] <stashbot>	 T324335: Remove logstash from the Search Elasticsearch servers - https://phabricator.wikimedia.org/T324335
[20:50:52] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335
[20:57:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44102 and previous config saved to /var/cache/conftool/dbconfig/20230209-205720-ladsgroup.json
[21:00:04] <jouncebot>	 brennen and TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T2100).
[21:00:04] <jouncebot>	 musikanimal: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:40] <thcipriani>	 musikanimal: around for backport? I can deploy
[21:02:07] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] slo_dashboards: dynamic slo dashboard panels (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[21:02:10] <musikanimal>	 o/
[21:02:17] <thcipriani>	 cool :)
[21:05:15] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check. [puppet] - 10https://gerrit.wikimedia.org/r/888087 (https://phabricator.wikimedia.org/T289623)
[21:07:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888066 (https://phabricator.wikimedia.org/T328224) (owner: 10MusikAnimal)
[21:08:55] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings: install PageAssessments on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888066 (https://phabricator.wikimedia.org/T328224) (owner: 10MusikAnimal)
[21:09:19] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:888066|InitialiseSettings: install PageAssessments on newiki (T328224)]]
[21:09:26] <stashbot>	 T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224
[21:11:11] <logmsgbot>	 !log thcipriani@deploy1002 musikanimal and thcipriani: Backport for [[gerrit:888066|InitialiseSettings: install PageAssessments on newiki (T328224)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:11:28] <thcipriani>	 ^ musikanimal your patch is on mwdebug, check please :)
[21:11:34] <musikanimal>	 checking!
[21:12:11] <musikanimal>	 db error. Did you run update.php? (sorry I don't know how this works for deployers)
[21:12:21] <musikanimal>	 I guess I should have said that beforehand, sorry
[21:12:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check. [puppet] - 10https://gerrit.wikimedia.org/r/888087 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott)
[21:12:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44103 and previous config saved to /var/cache/conftool/dbconfig/20230209-211226-ladsgroup.json
[21:12:53] <musikanimal>	 `mwscript extensions/WikimediaMaintenance/createExtensionTables.php newiki pageassessments`
[21:12:57] <musikanimal>	 on mwmaint1002
[21:13:02] <thcipriani>	 oof, musikanimal no, sorry, we don't run update.php as part of deploy. Usually folks sync up with the dba before hand to do that.
[21:13:13] <musikanimal>	 bah
[21:13:17] <thcipriani>	 :(
[21:13:18] <musikanimal>	 okay, this can wait if it needs to
[21:13:27] <thcipriani>	 I would prefer that
[21:13:32] <musikanimal>	 okay no problem :)
[21:13:39] <thcipriani>	 thanks and sorry, reverting
[21:13:44] <logmsgbot>	 !log thcipriani@deploy1002 sync-world aborted: Backport for [[gerrit:888066|InitialiseSettings: install PageAssessments on newiki (T328224)]] (duration: 04m 24s)
[21:13:44] <logmsgbot>	 !log thcipriani@deploy1002 backport aborted:  (duration: 06m 05s)
[21:13:53] <musikanimal>	 my fault! I should read the docs or something
[21:14:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "InitialiseSettings: install PageAssessments on newiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888089
[21:14:29] <wikibugs>	 (03CR) 10TrainBranchBot: "thcipriani@deploy1002 created a revert of this change as I0972d873ffaa4106a0bec64e758729e243bf8896" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888066 (https://phabricator.wikimedia.org/T328224) (owner: 10MusikAnimal)
[21:15:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888089 (owner: 10TrainBranchBot)
[21:16:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "InitialiseSettings: install PageAssessments on newiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888089 (owner: 10TrainBranchBot)
[21:17:02] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10jijiki)
[21:17:04] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:888089|Revert "InitialiseSettings: install PageAssessments on newiki"]]
[21:18:56] <logmsgbot>	 !log thcipriani@deploy1002 trainbranchbot and thcipriani: Backport for [[gerrit:888089|Revert "InitialiseSettings: install PageAssessments on newiki"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[21:19:32] <thcipriani>	 alright, should be all reset on the mwdebug servers
[21:19:37] <logmsgbot>	 !log thcipriani@deploy1002 Sync cancelled.
[21:27:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44104 and previous config saved to /var/cache/conftool/dbconfig/20230209-212732-ladsgroup.json
[21:27:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[21:27:37] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[21:27:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[21:27:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[21:27:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[21:27:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44105 and previous config saved to /var/cache/conftool/dbconfig/20230209-212747-ladsgroup.json
[21:29:32] <wikibugs>	 10SRE, 10Traffic: create a puppetized abstraction for haproxy blocklist hysteresis - https://phabricator.wikimedia.org/T329331 (10CDanis)
[21:36:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44106 and previous config saved to /var/cache/conftool/dbconfig/20230209-213607-ladsgroup.json
[21:36:11] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[21:47:24] <wikibugs>	 10SRE, 10Traffic, 10Data Pipelines (Sprint 08): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10Snwachukwu) Here is a google [[ https://docs.google.com/document/d/1rz7L24EVECOKYGhn-GTUIGo3NmbCIGCQW9XXLNrxCEM/edit# | doc ]] containing a draft
[21:51:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44107 and previous config saved to /var/cache/conftool/dbconfig/20230209-215114-ladsgroup.json
[22:00:58] <wikibugs>	 (03PS1) 10Zabe: Start reading from rev_comment_id in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888093 (https://phabricator.wikimedia.org/T275246)
[22:06:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44108 and previous config saved to /var/cache/conftool/dbconfig/20230209-220620-ladsgroup.json
[22:15:12] <wikibugs>	 (03Abandoned) 10Cwhite: logstash: migrate mediawiki_ecs to ecs 1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/831952 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite)
[22:16:06] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: enable error.stack.previous_trace [puppet] - 10https://gerrit.wikimedia.org/r/886863 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite)
[22:21:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44109 and previous config saved to /var/cache/conftool/dbconfig/20230209-222126-ladsgroup.json
[22:21:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[22:21:31] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[22:21:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[22:21:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44110 and previous config saved to /var/cache/conftool/dbconfig/20230209-222137-ladsgroup.json
[22:25:53] <icinga-wm>	 PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 1893 MB (3% inode=97%): /srv/swift-storage/sda3 10727 MB (5% inode=99%): /tmp 1893 MB (3% inode=97%): /var/tmp 1893 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[22:26:13] <wikibugs>	 (03PS1) 10Volans: debmonitorgc: garbage collect also stale Hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/888095
[22:30:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44111 and previous config saved to /var/cache/conftool/dbconfig/20230209-223003-ladsgroup.json
[22:30:07] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[22:36:13] <wikibugs>	 10SRE, 10Data-Persistence, 10Discovery-Search, 10serviceops, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10bd808) >>! In T329193#8601521, @Clement_Goubert wrote: >>>! In T327920#8570661, @bd808 wrote: >> #Toolhub does not have a working Kube...
[22:40:32] <zabe>	 jouncebot, nowandnext
[22:40:32] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 19 minute(s)
[22:40:32] <jouncebot>	 In 8 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230210T0700)
[22:40:50] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Start reading from rev_comment_id in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888093 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe)
[22:41:56] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from rev_comment_id in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888093 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe)
[22:42:28] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:888093|Start reading from rev_comment_id in cebwiki (T275246)]]
[22:42:31] <stashbot>	 T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246
[22:44:20] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:888093|Start reading from rev_comment_id in cebwiki (T275246)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[22:45:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44112 and previous config saved to /var/cache/conftool/dbconfig/20230209-224509-ladsgroup.json
[22:50:55] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:888093|Start reading from rev_comment_id in cebwiki (T275246)]] (duration: 08m 26s)
[22:50:59] <stashbot>	 T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246
[23:00:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44113 and previous config saved to /var/cache/conftool/dbconfig/20230209-230016-ladsgroup.json
[23:03:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:08:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:10:37] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2332 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[23:10:37] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2329 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[23:14:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10Jhancock.wm) investigated each server individually.   mw2329 had a bad cord. replaced.     The input pow...
[23:14:35] <wikibugs>	 10SRE, 10Traffic: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10tstarling) Regarding the concern that malicious user input could lead to injection of ESI tags:  * In the old parser:   * HTML comments in user input are completely removed   * Angle brackets...
[23:15:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44114 and previous config saved to /var/cache/conftool/dbconfig/20230209-231522-ladsgroup.json
[23:15:26] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[23:26:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:31:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:32:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10Papaul) @Jhancock.wm thank you. You can resolve the task
[23:34:37] <wikibugs>	 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul)  Your shipment  1ZA19A020397868137    Delivered On  Thursday, February 09 at 3:41 P.M. at Dock Delivered To  LAREDO, TX US Received By:  ESQUIVEL  Proof of Delivery
[23:36:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:37:10] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:38:12] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887858
[23:38:14] <wikibugs>	 (03PS1) 10Zabe: Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887859 (https://phabricator.wikimedia.org/T275246)
[23:38:18] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887858 (owner: 10Ladsgroup)
[23:38:32] <wikibugs>	 (03Abandoned) 10Zabe: Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887859 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe)
[23:39:14] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887858 (owner: 10Ladsgroup)
[23:39:59] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:887858|Revert "Start reading from rev_comment_id in cebwiki"]]
[23:41:51] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:887858|Revert "Start reading from rev_comment_id in cebwiki"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[23:41:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:42:10] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:46:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[23:49:03] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:887858|Revert "Start reading from rev_comment_id in cebwiki"]] (duration: 09m 04s)