[00:00:25] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:41] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:09] (03PS1) 10Cwhite: Remove non-kafka logstash nodes from kafka configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/886862 (https://phabricator.wikimedia.org/T329142) [00:09:53] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2023-01-31 00:00:13 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:12:08] (03PS1) 10Cwhite: logstash: enable error.stack.previous_trace [puppet] - 10https://gerrit.wikimedia.org/r/886863 (https://phabricator.wikimedia.org/T314098) [00:13:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T328817)', diff saved to https://phabricator.wikimedia.org/P43919 and previous config saved to /var/cache/conftool/dbconfig/20230209-001340-marostegui.json [00:13:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [00:13:44] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [00:13:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [00:14:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T328817)', diff saved to https://phabricator.wikimedia.org/P43920 and previous config saved to /var/cache/conftool/dbconfig/20230209-001401-marostegui.json [00:14:17] (03CR) 10CI reject: [V: 04-1] logstash: enable error.stack.previous_trace [puppet] - 10https://gerrit.wikimedia.org/r/886863 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite) [00:16:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T328817)', diff saved to https://phabricator.wikimedia.org/P43921 and previous config saved to /var/cache/conftool/dbconfig/20230209-001613-marostegui.json [00:17:39] (03PS2) 10Cwhite: logstash: enable error.stack.previous_trace [puppet] - 10https://gerrit.wikimedia.org/r/886863 (https://phabricator.wikimedia.org/T314098) [00:18:41] (03CR) 10Andrew Bogott: [C: 03+2] Backy2 backup jobs: don't email on failure [puppet] - 10https://gerrit.wikimedia.org/r/886470 (https://phabricator.wikimedia.org/T328868) (owner: 10Andrew Bogott) [00:19:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43922 and previous config saved to /var/cache/conftool/dbconfig/20230209-001910-ladsgroup.json [00:19:13] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [00:22:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:22:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2423.codfw.wmnet with OS buster [00:22:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2423.codfw.wmnet with OS buster completed: - mw2423 (**PASS**) - Removed from Pupp... [00:24:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2424.codfw.wmnet with OS buster [00:24:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2424.codfw.wmnet with OS buster [00:27:25] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [00:31:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P43923 and previous config saved to /var/cache/conftool/dbconfig/20230209-003119-marostegui.json [00:34:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43924 and previous config saved to /var/cache/conftool/dbconfig/20230209-003416-ladsgroup.json [00:40:39] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2023-02-07 15:56:09 (4056 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:41:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2425.codfw.wmnet with OS buster [00:41:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2425.codfw.wmnet with OS buster [00:46:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P43925 and previous config saved to /var/cache/conftool/dbconfig/20230209-004625-marostegui.json [00:49:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P43926 and previous config saved to /var/cache/conftool/dbconfig/20230209-004923-ladsgroup.json [00:50:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2424.codfw.wmnet with reason: host reimage [00:53:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2424.codfw.wmnet with reason: host reimage [01:00:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2425.codfw.wmnet with reason: host reimage [01:01:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T328817)', diff saved to https://phabricator.wikimedia.org/P43927 and previous config saved to /var/cache/conftool/dbconfig/20230209-010132-marostegui.json [01:01:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:01:36] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [01:01:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:03:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2425.codfw.wmnet with reason: host reimage [01:04:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P43928 and previous config saved to /var/cache/conftool/dbconfig/20230209-010429-ladsgroup.json [01:04:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [01:04:33] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [01:04:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [01:04:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43929 and previous config saved to /var/cache/conftool/dbconfig/20230209-010450-ladsgroup.json [01:09:29] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:17:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:22:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:22:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2425.codfw.wmnet with OS buster [01:22:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:22:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2424.codfw.wmnet with OS buster [01:22:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2425.codfw.wmnet with OS buster completed: - mw2425 (**PASS**) - Removed from Pupp... [01:22:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2424.codfw.wmnet with OS buster completed: - mw2424 (**PASS**) - Removed from Pupp... [01:23:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2426.codfw.wmnet with OS buster [01:23:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2426.codfw.wmnet with OS buster [01:27:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2427.codfw.wmnet with OS buster [01:28:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2427.codfw.wmnet with OS buster [01:36:36] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [01:42:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2426.codfw.wmnet with reason: host reimage [01:45:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2426.codfw.wmnet with reason: host reimage [01:47:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2427.codfw.wmnet with reason: host reimage [01:47:54] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [01:48:58] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:50:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2427.codfw.wmnet with reason: host reimage [02:00:16] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:04:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43930 and previous config saved to /var/cache/conftool/dbconfig/20230209-020401-ladsgroup.json [02:04:04] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [02:04:59] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:10:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:10:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2426.codfw.wmnet with OS buster [02:10:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2426.codfw.wmnet with OS buster completed: - mw2426 (**PASS**) - Removed from Pupp... [02:10:46] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:11:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2427.codfw.wmnet with OS buster [02:11:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2427.codfw.wmnet with OS buster completed: - mw2427 (**PASS**) - Removed from Pupp... [02:11:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2428.codfw.wmnet with OS buster [02:11:36] could I get a deployer to run a quick and harmless maintenance script on zhwiki for me? or should that go through a backport window? (there's no patch) [02:11:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2428.codfw.wmnet with OS buster [02:11:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [02:17:04] musikanimal: depends on the script :D [02:18:42] TheresNoTime: there was a botched deploy of PageAssessments to zhwiki [02:18:50] https://phabricator.wikimedia.org/T328224 for context [02:19:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43931 and previous config saved to /var/cache/conftool/dbconfig/20230209-021907-ladsgroup.json [02:19:30] looking.. [02:20:09] we need the purgeUnusedProjects.php maintenance script to be ran [02:20:46] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:54] I removed the page assessments, and that table is now (almost) empty... but anyway the page_assessments_projects are the corrupt data, and running purgeUnusedProjects.php should clear those out [02:21:12] okay, one moment :) [02:21:51] I have prod db access FYI, just read-only. `SELECT COUNT(*) FROM page_assessments_projects` reports 479, that should be zero (or very close to zero) [02:23:37] !log `[samtar@mwmaint1002 ~]$ mwscript extensions/PageAssessments/maintenance/purgeUnusedProjects.php --wiki zhwiki --dry-run` for T326387 [02:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:40] T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387 [02:24:36] musikanimal: does https://phabricator.wikimedia.org/P43932 look reasonable? [02:25:11] yep! though I wonder if you're able to TRUNCATE `page_assessments` first? then it would actually be zero [02:25:43] apparently there's a bug in PageAssessments where pages moved without redirect leave the assessments behind, so there's 7 rows in `page_assessments` for non-existent pages, and so the maintenance script thinks those WikiProjects are being used [02:26:21] doing direct db writes/deletes etc seems scary so no worries if you don't want to or can't [02:26:32] there will only be a few rows of bad data, no big deal :) [02:26:54] musikanimal: okay, I am `sql zhwiki`, and I am going to run `TRUNCATE TABLE page_assessments;`, correct? [02:27:04] It won't work like that [02:27:11] You're connected to a replica [02:27:21] ah [02:27:40] then I'm going to run the maintenance script and we can worry about those 7 later, sound okay musikanimal? [02:27:53] you can do it via connecting to the master ;P [02:28:29] !log `[samtar@mwmaint1002 ~]$ mwscript extensions/PageAssessments/maintenance/purgeUnusedProjects.php --wiki zhwiki` for T326387 [02:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:04] that's fine, I know which rows should be removed and we can remove them later (or not) [02:30:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2428.codfw.wmnet with reason: host reimage [02:30:46] okay yeah, those 7 leftover rows in page_assessments_projects are not task forces (sub-WikiProjects), so it doesn't matter anyway. We're all set! [02:30:48] thank you!! [02:30:54] No worries :) [02:31:07] now I can add back the parser function then things should be stored correctly [02:33:05] Reedy: running a `TRUNCATE` is scary enough tyvm :p [02:33:46] if there's only 7 rows, you could do delete from table where id in [ list ]; [02:33:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2428.codfw.wmnet with reason: host reimage [02:34:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P43933 and previous config saved to /var/cache/conftool/dbconfig/20230209-023413-ladsgroup.json [02:34:29] lol, I agree TRUNCATE is scarrrry! [02:36:13] (03PS1) 10Raymond Ndibe: puppet: adapt replica_cnf_api to python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/887872 (https://phabricator.wikimedia.org/T304040) [02:36:58] so interesting thing I've been wondering about... the API says there are zero jobs in the queue on zhwiki, but for sure there's about 800K+ that just got fired off after the template I just edited [02:37:20] why is that, and where should I go to see the actual number of pending jobs? [02:37:24] I think general advice is to ignore what the API says for the job queue size [02:38:51] haha ok. I guess I can query `job` directly [02:40:04] https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue#Monitoring [02:40:28] https://logstash.wikimedia.org/goto/ccd5e2517591489ad88bc66922f8311c being a dead link, *chef kiss* [02:40:50] heh [02:40:51] yeah [02:41:01] !bug 1 [02:41:02] https://bugzilla.wikimedia.org/show_bug.cgi?id=1 [02:41:11] well they're not in the `job` table, I just queried and that is in fact 0 rows [02:41:39] haha!! nice one Reedy, I'm going to have to remember that [02:41:40] yeah, WMF production hasn't used the job table in a looong time [02:41:56] I'm guessing that's what the API is reporting? [02:42:08] I think it can report *some* other sources [02:43:19] I *think* it's meant to be https://logstash.wikimedia.org/goto/d6d10e8e40672fcca72e3e556b7af954 ? [02:45:32] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue and https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1 [02:45:41] It's hard to know what old links are actually supposed to be poitning to [02:48:08] well I fixed the logstash link in https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue#Logs (but left the old one in a just in case..) [02:48:22] TheresNoTime: If only the pages had history... :P [02:48:49] * TheresNoTime mutters [02:49:05] (03CR) 10Raymond Ndibe: "all tests passing both unit tests and functional tests on dbusers-nfs-1.testlabs.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/887872 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [02:49:12] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:49:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P43934 and previous config saved to /var/cache/conftool/dbconfig/20230209-024920-ladsgroup.json [02:49:23] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [02:50:26] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:54:36] (03CR) 10Raymond Ndibe: puppet: modify role::wmcs::nfs::primary for replica_cnf api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [02:56:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:56:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2428.codfw.wmnet with OS buster [02:56:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2428.codfw.wmnet with OS buster completed: - mw2428 (**PASS**) - Removed from Pupp... [03:12:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [03:47:19] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Samwilson) Related Community Wishlist Survey proposal: [[https://meta.wikimedia.org/wiki... [04:44:14] (03PS1) 10KartikMistry: CX: Provide the appropriate arguments to ve.ui.CXSurface constructor [extensions/ContentTranslation] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887847 (https://phabricator.wikimedia.org/T329154) [05:10:25] (03CR) 10Legoktm: "Overall looks fine, my two comments aren't blockers, just suggestions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [05:10:42] urbanecm: hope that helps [05:56:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:56:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:57:07] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:01:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:01:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:02:21] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:32:11] I am switching over phabricator master in 30 minutes, meaning 1 minute of read only time [06:38:29] (03PS2) 10Marostegui: mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887727 (https://phabricator.wikimedia.org/T329141) [06:40:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1159 to m3 mater [puppet] - 10https://gerrit.wikimedia.org/r/887727 (https://phabricator.wikimedia.org/T329141) (owner: 10Marostegui) [06:48:09] !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter status all services in eqiad: maintenance [06:48:17] !log oblivian@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in eqiad: maintenance [06:54:44] (03CR) 10Giuseppe Lavagetto: sre.discovery.datacenter: rename and add status command (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [06:55:35] (03PS4) 10Giuseppe Lavagetto: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T0700) [07:00:04] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T0700). [07:00:06] !log Failover m3 from db1164 to db1159 - T329141 [07:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:09] T329141: Switchover m3 master db1164 -> db1159 - https://phabricator.wikimedia.org/T329141 [07:02:47] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [07:03:29] (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887877 (https://phabricator.wikimedia.org/T329143) [07:04:02] (03CR) 10Marostegui: [C: 03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887877 (https://phabricator.wikimedia.org/T329143) (owner: 10Marostegui) [07:04:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:04:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:09:07] (03PS1) 10Marostegui: db1098: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/887878 (https://phabricator.wikimedia.org/T329171) [07:09:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:09:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:09:48] (03CR) 10Marostegui: [C: 03+2] db1098: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/887878 (https://phabricator.wikimedia.org/T329171) (owner: 10Marostegui) [07:10:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1098 (s6, s7) from dbctl T329171', diff saved to https://phabricator.wikimedia.org/P43935 and previous config saved to /var/cache/conftool/dbconfig/20230209-071013-marostegui.json [07:10:17] T329171: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171 [07:18:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [07:19:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [07:19:49] (03PS1) 10Marostegui: db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887879 (https://phabricator.wikimedia.org/T329171) [07:20:59] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) >>! In T328623#8594346, @Jhancock.wm wrote: > We did some more troubleshooting and it looks like the slot for DIMM_B4 is bad. This may need a MB replacement to fully fix. Thanks - just let me know whe... [07:21:11] (03CR) 10Marostegui: [C: 03+2] db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887879 (https://phabricator.wikimedia.org/T329171) (owner: 10Marostegui) [07:21:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [07:21:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [07:22:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T328817)', diff saved to https://phabricator.wikimedia.org/P43936 and previous config saved to /var/cache/conftool/dbconfig/20230209-072204-marostegui.json [07:22:07] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:23:55] (03PS1) 10Marostegui: mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/887880 (https://phabricator.wikimedia.org/T329143) [07:24:40] !log Stop mariadb on db1117:3321 (some dbproxy irc alerts will be triggered) T329143 [07:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:43] T329143: Move db1164 to m1 - https://phabricator.wikimedia.org/T329143 [07:25:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T328817)', diff saved to https://phabricator.wikimedia.org/P43938 and previous config saved to /var/cache/conftool/dbconfig/20230209-072535-marostegui.json [07:26:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/887880 (https://phabricator.wikimedia.org/T329143) (owner: 10Marostegui) [07:33:41] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:40:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P43939 and previous config saved to /var/cache/conftool/dbconfig/20230209-074042-marostegui.json [07:45:17] Again, dbproxy irc alerts are expected [07:48:28] (03CR) 10Elukey: "Ben I left a comment about a setting that may cause a runtime error from puppet, lemme know what you think. My knowledge about profile::ha" [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis) [07:52:11] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:55:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P43940 and previous config saved to /var/cache/conftool/dbconfig/20230209-075548-marostegui.json [08:00:04] Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T0800). [08:00:04] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:28] morning! woops, forgot to check the deployments calendar, give me one sec [08:00:46] * kart_ is here [08:00:56] no trainees signed up for the slot [08:01:12] kart_: care to self-deploy? I know you're usually good for it [08:01:12] apergos: I guess, I can go ahead.. [08:01:25] apergos: yes :) [08:01:45] okey dokey, it's all you! [08:02:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887847 (https://phabricator.wikimedia.org/T329154) (owner: 10KartikMistry) [08:06:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey) [08:09:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [08:10:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T328817)', diff saved to https://phabricator.wikimedia.org/P43941 and previous config saved to /var/cache/conftool/dbconfig/20230209-081054-marostegui.json [08:10:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [08:10:58] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:11:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [08:11:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T328817)', diff saved to https://phabricator.wikimedia.org/P43942 and previous config saved to /var/cache/conftool/dbconfig/20230209-081116-marostegui.json [08:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T328817)', diff saved to https://phabricator.wikimedia.org/P43943 and previous config saved to /var/cache/conftool/dbconfig/20230209-081433-marostegui.json [08:17:10] (03CR) 10Muehlenhoff: add SPDX license headers to various roles I was involved in writing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887382 (owner: 10Dzahn) [08:18:35] (03Merged) 10jenkins-bot: CX: Provide the appropriate arguments to ve.ui.CXSurface constructor [extensions/ContentTranslation] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887847 (https://phabricator.wikimedia.org/T329154) (owner: 10KartikMistry) [08:19:04] !log kartik@deploy1002 Started scap: Backport for [[gerrit:887847|CX: Provide the appropriate arguments to ve.ui.CXSurface constructor (T329154)]] [08:19:07] T329154: Content Translation is broken in test wiki - https://phabricator.wikimedia.org/T329154 [08:20:15] (03PS24) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [08:20:23] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:20:59] !log kartik@deploy1002 kartik: Backport for [[gerrit:887847|CX: Provide the appropriate arguments to ve.ui.CXSurface constructor (T329154)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:24:05] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:24:15] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:25:45] 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) [08:29:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P43944 and previous config saved to /var/cache/conftool/dbconfig/20230209-082940-marostegui.json [08:29:47] (03PS1) 10Vgutierrez: cp4044: Enable ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) [08:31:57] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/886331 (owner: 10Slyngshede) [08:32:09] how's it looking? [08:32:14] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:887847|CX: Provide the appropriate arguments to ve.ui.CXSurface constructor (T329154)]] (duration: 13m 10s) [08:32:17] T329154: Content Translation is broken in test wiki - https://phabricator.wikimedia.org/T329154 [08:34:57] /buffer 6 [08:41:06] kart_: ? how are things? [08:41:33] apergos: all done. Sorry for delay. [08:41:46] ok! no worries, that's the window for today then [08:42:10] !log UTC morning backport and config training window complete [08:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:30] see everyone here again next time! [08:44:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P43945 and previous config saved to /var/cache/conftool/dbconfig/20230209-084446-marostegui.json [08:46:09] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:40] (03PS1) 10Marostegui: add_cuc_only_for_read_old_T329203.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887884 (https://phabricator.wikimedia.org/T329203) [08:55:51] apergos: can I backport something now? [08:55:54] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/887843 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [08:56:00] I can also wait until afternoon window [08:56:47] jouncebot, next [08:56:48] In 2 hour(s) and 3 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100) [08:56:48] In 2 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100) [08:57:24] I am not them, but since there is nothing after this window for 2 hours you could just deploy somewhere in that time period [08:57:53] !log depool cp4044 - T308799 [08:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:57] T308799: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 [08:58:24] zabe: ok, I'll get started then [08:58:48] (03CR) 10Vgutierrez: [C: 03+2] cp4044: Enable ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [08:58:58] (03PS1) 10Marostegui: monitoring.yaml: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259) [08:59:11] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui) [08:59:14] (03CR) 10Vgutierrez: cp4044: Enable ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [08:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T328817)', diff saved to https://phabricator.wikimedia.org/P43946 and previous config saved to /var/cache/conftool/dbconfig/20230209-085952-marostegui.json [08:59:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [08:59:57] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:00:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [09:00:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [09:00:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [09:00:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T328817)', diff saved to https://phabricator.wikimedia.org/P43947 and previous config saved to /var/cache/conftool/dbconfig/20230209-090018-marostegui.json [09:00:52] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39477/console" [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [09:01:53] (03PS1) 10Kosta Harlan: ComputedUserImpactLookup: Reduce logspam for page view rate limiting [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) [09:02:12] (03PS1) 10Kosta Harlan: Add StatusValue::hasMessagesExcept() [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887849 (https://phabricator.wikimedia.org/T272081) [09:02:20] (03CR) 10CI reject: [V: 04-1] ComputedUserImpactLookup: Reduce logspam for page view rate limiting [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) (owner: 10Kosta Harlan) [09:02:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T328817)', diff saved to https://phabricator.wikimedia.org/P43948 and previous config saved to /var/cache/conftool/dbconfig/20230209-090236-marostegui.json [09:02:46] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cp4044: Enable ESI testing [puppet] - 10https://gerrit.wikimedia.org/r/887882 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [09:03:01] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [09:04:08] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) (owner: 10Kosta Harlan) [09:04:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto) [09:04:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887849 (https://phabricator.wikimedia.org/T272081) (owner: 10Kosta Harlan) [09:05:25] (03CR) 10Muehlenhoff: [C: 03+2] Update cloud proxies [puppet] - 10https://gerrit.wikimedia.org/r/887798 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [09:07:17] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Arclamp [puppet] - 10https://gerrit.wikimedia.org/r/887769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:07:36] (03CR) 10Hashar: [C: 04-1] "On the devtools project, I have rebooted our testing Phabricator instance phabricator-prod-1001 and confirmed phd failed to start." [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [09:08:34] (03PS10) 10Hashar: phabricator: create phd home directory on service start [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) [09:08:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 db1132', diff saved to https://phabricator.wikimedia.org/P43949 and previous config saved to /var/cache/conftool/dbconfig/20230209-090846-root.json [09:09:22] !log pool cp4044 with ESI testing enabled [09:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:34] !log Install 10.6.12 on db1132 T329011 [09:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:37] T329011: Compile and package MariaDB 10.4.28 and 10.6.12 - https://phabricator.wikimedia.org/T329011 [09:10:42] !log Install 10.4.28 on db1107 T329011 [09:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43950 and previous config saved to /var/cache/conftool/dbconfig/20230209-091145-root.json [09:11:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P43951 and previous config saved to /var/cache/conftool/dbconfig/20230209-091149-root.json [09:13:28] !log installing openssl security updates on Bullseye [09:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P43952 and previous config saved to /var/cache/conftool/dbconfig/20230209-091742-marostegui.json [09:18:50] (03PS2) 10Jcrespo: dbbackups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/887885 (https://phabricator.wikimedia.org/T329259) (owner: 10Marostegui) [09:20:04] (03Merged) 10jenkins-bot: Add StatusValue::hasMessagesExcept() [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887849 (https://phabricator.wikimedia.org/T272081) (owner: 10Kosta Harlan) [09:20:13] (03CR) 10Filippo Giunchedi: [C: 03+1] Upgrade plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite) [09:20:31] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:887849|Add StatusValue::hasMessagesExcept() (T272081)]] [09:20:35] T272081: Introduce StatusValue::ignore method - https://phabricator.wikimedia.org/T272081 [09:20:35] (03CR) 10Hashar: [C: 03+1] "I made the RuntimeDirectory relative. Applied the patch on the puppetmaster, rebooted the instance and this time it works with `/var/run/p" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [09:22:23] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:887849|Add StatusValue::hasMessagesExcept() (T272081)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [09:22:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "A bit more of tech-debt but Good Enough™, and probably better than forcing statsd.eqiad.wmnet to v4" [puppet] - 10https://gerrit.wikimedia.org/r/887804 (owner: 10Herron) [09:24:47] (03PS1) 10Vgutierrez: Revert "cp4044: Enable ESI testing" [puppet] - 10https://gerrit.wikimedia.org/r/887850 [09:25:13] (03PS2) 10Vgutierrez: Revert "cp4044: Enable ESI testing" [puppet] - 10https://gerrit.wikimedia.org/r/887850 (https://phabricator.wikimedia.org/T308799) [09:25:49] (03CR) 10Vgutierrez: [C: 03+2] Revert "cp4044: Enable ESI testing" [puppet] - 10https://gerrit.wikimedia.org/r/887850 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [09:26:10] (03PS2) 10Filippo Giunchedi: opensearch_dashboards: enforce memory limit [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) [09:26:15] (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [09:26:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43953 and previous config saved to /var/cache/conftool/dbconfig/20230209-092650-root.json [09:26:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43954 and previous config saved to /var/cache/conftool/dbconfig/20230209-092654-root.json [09:27:03] (03CR) 10Filippo Giunchedi: [C: 03+2] opensearch_dashboards: enforce memory limit [puppet] - 10https://gerrit.wikimedia.org/r/887767 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [09:27:07] (03CR) 10Elukey: [C: 03+2] cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey) [09:27:15] (03PS6) 10Elukey: cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 [09:27:20] (03CR) 10Elukey: [V: 03+2] cumin: add more aliases for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/887743 (owner: 10Elukey) [09:28:51] (03PS1) 10Marostegui: control-mariadb-client-10.4: Update version [software] - 10https://gerrit.wikimedia.org/r/887935 (https://phabricator.wikimedia.org/T329011) [09:29:52] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:887849|Add StatusValue::hasMessagesExcept() (T272081)]] (duration: 09m 20s) [09:29:55] T272081: Introduce StatusValue::ignore method - https://phabricator.wikimedia.org/T272081 [09:30:08] 10SRE, 10Data-Persistence, 10Discovery-Search, 10serviceops, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) [09:31:29] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.4: Update version [software] - 10https://gerrit.wikimedia.org/r/887935 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui) [09:31:39] on to the next one [09:32:20] (03Merged) 10jenkins-bot: control-mariadb-client-10.4: Update version [software] - 10https://gerrit.wikimedia.org/r/887935 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui) [09:32:31] !log roll-restart opensearch-dashboards to apply memory limit - T327161 [09:32:32] (03PS2) 10Kosta Harlan: ComputedUserImpactLookup: Reduce logspam for page view rate limiting [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) [09:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:34] T327161: opensearch OOM on logstash102[34] - https://phabricator.wikimedia.org/T327161 [09:32:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) (owner: 10Kosta Harlan) [09:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P43955 and previous config saved to /var/cache/conftool/dbconfig/20230209-093248-marostegui.json [09:32:58] I had to remove the Depends-On in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/887848/ for the patch to work with scap backport [09:34:51] (03CR) 10Elukey: "Testing the code on cumin1001 with Dry-run, fixing little bugs and then report back." [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:36:47] (03PS3) 10FNegri: Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 [09:37:08] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Update to 10.6.12 [software] - 10https://gerrit.wikimedia.org/r/887936 (https://phabricator.wikimedia.org/T329011) [09:37:46] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Update to 10.6.12 [software] - 10https://gerrit.wikimedia.org/r/887936 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui) [09:38:18] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Update to 10.6.12 [software] - 10https://gerrit.wikimedia.org/r/887936 (https://phabricator.wikimedia.org/T329011) (owner: 10Marostegui) [09:40:14] (03PS1) 10Filippo Giunchedi: admin: move kwakuofori to ops [puppet] - 10https://gerrit.wikimedia.org/r/887937 (https://phabricator.wikimedia.org/T328787) [09:41:19] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) I'll let @LSobanski answer authoritatively for Phabricator and Etherpad. We are not switching over... [09:41:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43956 and previous config saved to /var/cache/conftool/dbconfig/20230209-094154-root.json [09:42:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43957 and previous config saved to /var/cache/conftool/dbconfig/20230209-094159-root.json [09:42:35] 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) False alert has still been reported today in (Var... [09:43:28] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [09:43:41] (03PS4) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 [09:45:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (if the key has been validated via some out-of-band channel)" [puppet] - 10https://gerrit.wikimedia.org/r/887937 (https://phabricator.wikimedia.org/T328787) (owner: 10Filippo Giunchedi) [09:47:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T328817)', diff saved to https://phabricator.wikimedia.org/P43958 and previous config saved to /var/cache/conftool/dbconfig/20230209-094755-marostegui.json [09:47:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [09:47:59] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:48:08] (03PS1) 10Muehlenhoff: Fix cloudvirt-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/887939 [09:48:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [09:48:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43959 and previous config saved to /var/cache/conftool/dbconfig/20230209-094816-marostegui.json [09:48:35] (03CR) 10Filippo Giunchedi: "Thank you for the quick review, I'll validate the key with Kwaku later today and then merge" [puppet] - 10https://gerrit.wikimedia.org/r/887937 (https://phabricator.wikimedia.org/T328787) (owner: 10Filippo Giunchedi) [09:49:36] (03Merged) 10jenkins-bot: ComputedUserImpactLookup: Reduce logspam for page view rate limiting [extensions/GrowthExperiments] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887848 (https://phabricator.wikimedia.org/T328945) (owner: 10Kosta Harlan) [09:49:59] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:887848|ComputedUserImpactLookup: Reduce logspam for page view rate limiting (T328945)]] [09:50:03] T328945: An earlier attempt to fetch page {page title} failed. To limit server load, retries have been blocked for 30 minutes. - https://phabricator.wikimedia.org/T328945 [09:50:10] (03PS5) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 [09:51:38] (03CR) 10Clément Goubert: [C: 03+1] sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto) [09:51:50] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:887848|ComputedUserImpactLookup: Reduce logspam for page view rate limiting (T328945)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [09:51:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43960 and previous config saved to /var/cache/conftool/dbconfig/20230209-095153-marostegui.json [09:53:39] (03PS5) 10Clément Goubert: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [09:53:41] (03PS6) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 [09:54:57] (03PS25) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [09:56:15] (03PS4) 10FNegri: Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 [09:56:18] (03CR) 10Volans: [C: 03+1] "This was a hard one because of the diff 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond) [09:57:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43961 and previous config saved to /var/cache/conftool/dbconfig/20230209-095659-root.json [09:57:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43962 and previous config saved to /var/cache/conftool/dbconfig/20230209-095704-root.json [09:58:06] (03CR) 10Elukey: "Ready for another review :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:59:06] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:887848|ComputedUserImpactLookup: Reduce logspam for page view rate limiting (T328945)]] (duration: 09m 06s) [09:59:09] T328945: An earlier attempt to fetch page {page title} failed. To limit server load, retries have been blocked for 30 minutes. - https://phabricator.wikimedia.org/T328945 [09:59:52] (03CR) 10Muehlenhoff: [C: 03+2] Fix cloudvirt-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/887939 (owner: 10Muehlenhoff) [10:01:34] !log UTC morning deploys really done [10:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:07] (03PS6) 10Clément Goubert: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [10:03:41] (03CR) 10Clément Goubert: [C: 03+1] "Reverted to the state of PS4 after a git-review mishap. Still lgtm." [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [10:03:47] (03CR) 10Jelto: jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [10:05:19] (03CR) 10Muehlenhoff: "Looks good to me, one suggestion inline related to the fingerprint validation" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [10:06:20] 10SRE, 10Observability-Alerting, 10observability: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10fgiunchedi) AFAICS we can't customize/template the URL karma builds for that link [10:07:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P43963 and previous config saved to /var/cache/conftool/dbconfig/20230209-100700-marostegui.json [10:07:17] (03CR) 10Btullis: [C: 03+1] Update analytics data purge for webrequest_actor [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal) [10:10:17] PROBLEM - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43964 and previous config saved to /var/cache/conftool/dbconfig/20230209-101204-root.json [10:12:06] (03PS1) 10Muehlenhoff: Remove installserver role from install1003 [puppet] - 10https://gerrit.wikimedia.org/r/887941 (https://phabricator.wikimedia.org/T327867) [10:12:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43965 and previous config saved to /var/cache/conftool/dbconfig/20230209-101209-root.json [10:14:45] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [10:21:22] (03CR) 10Muehlenhoff: [C: 03+2] Remove installserver role from install1003 [puppet] - 10https://gerrit.wikimedia.org/r/887941 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [10:22:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P43966 and previous config saved to /var/cache/conftool/dbconfig/20230209-102206-marostegui.json [10:25:06] (03CR) 10Nicolas Fraison: [C: 03+1] Update analytics data purge for webrequest_actor [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal) [10:27:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43967 and previous config saved to /var/cache/conftool/dbconfig/20230209-102709-root.json [10:27:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43968 and previous config saved to /var/cache/conftool/dbconfig/20230209-102713-root.json [10:30:03] (03CR) 10Clément Goubert: [C: 04-1] Add jaeger-es-index-cleaner (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [10:31:01] !log joal@deploy1002 Started deploy [airflow-dags/analytics@2ab6564]: Analytics deploy for 3 druid jobs and webrequest_actor jobs [10:31:19] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@2ab6564]: Analytics deploy for 3 druid jobs and webrequest_actor jobs (duration: 00m 17s) [10:32:21] (03CR) 10Btullis: [C: 03+1] Update analytics data purge for webrequest_actor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal) [10:34:03] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2052.codfw.wmnet with OS bullseye [10:34:14] (03CR) 10Ladsgroup: [C: 03+1] add_cuc_only_for_read_old_T329203.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887884 (https://phabricator.wikimedia.org/T329203) (owner: 10Marostegui) [10:34:16] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Adds 'before' directive to docker::network in gitlab runner setup [puppet] - 10https://gerrit.wikimedia.org/r/887843 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [10:34:22] (03CR) 10Marostegui: [C: 03+2] add_cuc_only_for_read_old_T329203.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887884 (https://phabricator.wikimedia.org/T329203) (owner: 10Marostegui) [10:34:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye [10:34:46] (03Merged) 10jenkins-bot: add_cuc_only_for_read_old_T329203.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887884 (https://phabricator.wikimedia.org/T329203) (owner: 10Marostegui) [10:35:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:35:56] (03PS1) 10Majavah: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 [10:35:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:36:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T329203)', diff saved to https://phabricator.wikimedia.org/P43970 and previous config saved to /var/cache/conftool/dbconfig/20230209-103604-marostegui.json [10:36:08] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:37:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43971 and previous config saved to /var/cache/conftool/dbconfig/20230209-103712-marostegui.json [10:37:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:37:17] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:37:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:37:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T328817)', diff saved to https://phabricator.wikimedia.org/P43972 and previous config saved to /var/cache/conftool/dbconfig/20230209-103733-marostegui.json [10:38:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T329203)', diff saved to https://phabricator.wikimedia.org/P43973 and previous config saved to /var/cache/conftool/dbconfig/20230209-103819-marostegui.json [10:38:21] !log installing containerd security updates [10:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39478/console" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [10:42:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T328817)', diff saved to https://phabricator.wikimedia.org/P43974 and previous config saved to /var/cache/conftool/dbconfig/20230209-104208-marostegui.json [10:42:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43975 and previous config saved to /var/cache/conftool/dbconfig/20230209-104214-root.json [10:42:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43976 and previous config saved to /var/cache/conftool/dbconfig/20230209-104218-root.json [10:47:29] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10MoritzMuehlenhoff) I've reset the Netbox status from Failed to Active. [10:48:03] (03CR) 10David Caro: [C: 03+1] "LGTM, let me know how it goes" [puppet] - 10https://gerrit.wikimedia.org/r/887789 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [10:48:53] (03CR) 10Volans: "much better! replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:50:21] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2052.codfw.wmnet with reason: host reimage [10:52:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2052.codfw.wmnet with reason: host reimage [10:53:01] 10SRE: add Hal Triedman (htriedman) to ops-l mailing list - https://phabricator.wikimedia.org/T329209 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thank you for reaching out @Htriedman ! Sign up is self-service here (list owners will need to approve the request) https://lists.wikimedia.org/postorius/lis... [10:53:04] (03CR) 10JMeybohm: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [10:53:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P43977 and previous config saved to /var/cache/conftool/dbconfig/20230209-105325-marostegui.json [10:53:57] (03PS1) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 [10:55:35] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner1003.eqiad.wmnet with OS bullseye [10:57:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P43978 and previous config saved to /var/cache/conftool/dbconfig/20230209-105714-marostegui.json [10:57:58] (03PS5) 10FNegri: Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 [10:58:17] (03CR) 10Filippo Giunchedi: opensearch: reverse-proxy access to opensearch API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [10:58:37] !log joal@deploy1002 Started deploy [airflow-dags/analytics@dff3f3b]: Fix analytics webrequest_actor_metrics_rollup sensor [10:58:47] (03CR) 10FNegri: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [10:58:51] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@dff3f3b]: Fix analytics webrequest_actor_metrics_rollup sensor (duration: 00m 13s) [10:59:00] (03PS1) 10Nicolas Fraison: fix(varnishkafka): add alert duration of 5m to avoid false positive [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) [10:59:27] (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi) [10:59:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on puppetdb2003.codfw.wmnet with reason: master is being reimaged [10:59:42] (03PS1) 10Marostegui: drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) [10:59:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on puppetdb2003.codfw.wmnet with reason: master is being reimaged [10:59:54] (03PS2) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 [11:00:04] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b1f3dbef-467c-49de-8608-5ba564efbe81) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [11:00:05] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100). [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100) [11:00:24] (03CR) 10Ayounsi: [C: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi) [11:01:36] (03CR) 10Ladsgroup: drop_cuc_comment_T329260.py: New schema change (032 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui) [11:02:43] (03PS2) 10Marostegui: drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) [11:02:48] !log powercycle mc-gp1001 [11:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:01] (03CR) 10Marostegui: drop_cuc_comment_T329260.py: New schema change (032 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui) [11:04:08] (03CR) 10Ladsgroup: drop_cuc_comment_T329260.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui) [11:04:59] 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, and 2 others: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) From those graph we can see that no requests have been received on the varnish which leads to no... [11:05:45] (03CR) 10Ladsgroup: drop_cuc_comment_T329260.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui) [11:05:48] (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi) [11:06:17] (03PS3) 10Marostegui: drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) [11:06:40] (03PS3) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 [11:06:42] (03CR) 10Marostegui: drop_cuc_comment_T329260.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui) [11:07:25] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage [11:08:12] (03PS1) 10Muehlenhoff: Reset puppetdb1003/2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/887971 (https://phabricator.wikimedia.org/T321783) [11:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P43979 and previous config saved to /var/cache/conftool/dbconfig/20230209-110832-marostegui.json [11:08:36] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2052.codfw.wmnet with OS bullseye [11:09:53] 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, and 2 others: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) The drop is indeed due to a depool ` 09:09 pool cp4044 with ESI testing enabled... [11:10:33] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage [11:11:14] (03CR) 10Muehlenhoff: [C: 03+2] Reset puppetdb1003/2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/887971 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:11:47] (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi) [11:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P43980 and previous config saved to /var/cache/conftool/dbconfig/20230209-111220-marostegui.json [11:13:46] (03PS26) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [11:14:13] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, 10User-jbond: Netbox: use the netbox to upet sync to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) [11:14:29] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to upet sync to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) p:05Triage→03Medium [11:15:09] (03CR) 10JMeybohm: Add sre.k8s.upgrade-cluster (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:16:01] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:16:02] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10jbond) 05Open→03Resolved This is now completed [11:16:05] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [11:16:21] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10taavi) [11:16:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to upet sync to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) We should also see if we can use the same scripts/data to opulate https://gerrit.wikimedia.org/r/c/opera... [11:16:43] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) [11:17:40] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Puppet class systemd needs to throw a more useful error - https://phabricator.wikimedia.org/T195553 (10taavi) 05Open→03Invalid Boldly closing as we're fully on systemd. [11:19:35] (03CR) 10Volans: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [11:20:08] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc-gp1001.eqiad.wmnet with OS bullseye [11:20:58] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye [11:23:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T329203)', diff saved to https://phabricator.wikimedia.org/P43981 and previous config saved to /var/cache/conftool/dbconfig/20230209-112338-marostegui.json [11:23:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance [11:23:42] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:23:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance [11:23:56] (03PS27) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [11:24:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T329203)', diff saved to https://phabricator.wikimedia.org/P43982 and previous config saved to /var/cache/conftool/dbconfig/20230209-112359-marostegui.json [11:24:11] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:25:12] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Aklapper) @Dzahn: Did you have any luck getting a reply? [11:27:10] 10SRE, 10DNS, 10Infrastructure-Foundations: Reverse DNS missing for some hosts - https://phabricator.wikimedia.org/T251522 (10Aklapper) @Reedy: ping? [11:27:11] (03CR) 10Muehlenhoff: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [11:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T328817)', diff saved to https://phabricator.wikimedia.org/P43983 and previous config saved to /var/cache/conftool/dbconfig/20230209-112727-marostegui.json [11:27:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [11:27:31] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:27:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [11:27:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [11:27:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43984 and previous config saved to /var/cache/conftool/dbconfig/20230209-112748-marostegui.json [11:28:38] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [11:29:14] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) All install servers are running Bullseye now, the only missing bit is to remove the old VMs. [11:29:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T329203)', diff saved to https://phabricator.wikimedia.org/P43985 and previous config saved to /var/cache/conftool/dbconfig/20230209-112927-marostegui.json [11:29:31] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:30:29] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Sadads) You can close this ticket. Its been resolved. [11:31:07] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1003.eqiad.wmnet with OS bullseye [11:31:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43986 and previous config saved to /var/cache/conftool/dbconfig/20230209-113125-marostegui.json [11:32:04] (03CR) 10Giuseppe Lavagetto: [WIP] Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi) [11:33:03] (03CR) 10Muehlenhoff: [C: 03+1] jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [11:34:21] !log Stop mariadb on db1098 (s6 and s7) T329171 [11:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:24] T329171: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171 [11:38:43] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:39:17] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:40:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb1003.eqiad.wmnet with OS bullseye [11:41:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bullseye [11:42:09] (03PS1) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 [11:42:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:44:12] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff) [11:44:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:44:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P43988 and previous config saved to /var/cache/conftool/dbconfig/20230209-114434-marostegui.json [11:45:43] (03PS1) 10Aklapper: Remove redirect for pk.wikimedia.org (Pakistan) [puppet] - 10https://gerrit.wikimedia.org/r/887980 (https://phabricator.wikimedia.org/T328596) [11:46:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P43989 and previous config saved to /var/cache/conftool/dbconfig/20230209-114632-marostegui.json [11:47:05] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10FRomeo_WMF) 05Open→03Resolved The Google Group was set up (thanks) and we just refreshed the membership and management. [11:47:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:48:17] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:48:40] (03PS1) 10JMeybohm: k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) [11:49:01] (03CR) 10CI reject: [V: 04-1] k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:50:56] (03PS1) 10Volans: Add Makefile.deploy for the deploy cookbook [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 [11:51:24] (03PS1) 10EoghanGaffney: Try running docker before the base firewall rules are added [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) [11:51:26] (03CR) 10Volans: "This can be compared with the one present in homer:" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 (owner: 10Volans) [11:52:01] (03PS2) 10Ladsgroup: Migrate Babel config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887307 (https://phabricator.wikimedia.org/T308932) [11:52:05] !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc-gp1001.eqiad.wmnet with OS bullseye [11:52:11] jouncebot: nowandnext [11:52:11] For the next 0 hour(s) and 7 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100) [11:52:11] For the next 0 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1100) [11:52:11] In 2 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1400) [11:52:11] In 2 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1400) [11:52:29] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye [11:52:35] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39479/console" [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:53:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage [11:53:34] (03CR) 10Ladsgroup: [C: 03+2] Migrate Babel config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887307 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [11:54:32] (03Merged) 10jenkins-bot: Migrate Babel config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887307 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [11:55:05] (03CR) 10Volans: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff) [11:55:11] 10SRE-tools, 10Discovery-Search, 10Elasticsearch, 10Infrastructure-Foundations, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775 (10jbond) >>! In T328775#8599015, @bking wrote: > Thanks @jbond ! Looking at the Spicerack chang... [11:55:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage [11:57:23] !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc-gp1001.eqiad.wmnet with OS bullseye [11:57:41] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye [11:57:48] (03CR) 10FNegri: [C: 03+2] Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [11:58:08] (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [11:58:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:59:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P43990 and previous config saved to /var/cache/conftool/dbconfig/20230209-115940-marostegui.json [12:01:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P43991 and previous config saved to /var/cache/conftool/dbconfig/20230209-120138-marostegui.json [12:02:12] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on an-worker1098.eqiad.wmnet with reason: Attempting to move some GPUs [12:02:17] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:02:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-worker1098.eqiad.wmnet with reason: Attempting to move some GPUs [12:02:32] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on an-worker1099.eqiad.wmnet with reason: Attempting to move some GPUs [12:02:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=49b5d5ab-a254-46d1-b90a-001be80f1... [12:02:46] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-worker1099.eqiad.wmnet with reason: Attempting to move some GPUs [12:02:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2de0632f-155c-4404-88de-ffa2c986c... [12:03:08] !log ladsgroup@deploy1002 Synchronized wmf-config/ext-Babel.php: Move Babel settings from IS.php to ext-Babel.php, part I (T308932) (duration: 07m 06s) [12:03:12] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [12:06:29] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: Attempting to move some GPUs [12:06:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: Attempting to move some GPUs [12:06:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ac0799f5-49e4-45fd-99e4-a3048068d... [12:10:16] !log ladsgroup@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Move Babel settings from IS.php to ext-Babel.php, part II (T308932) (duration: 06m 40s) [12:10:19] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [12:10:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetdb1003.eqiad.wmnet with OS bullseye [12:10:38] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bullseye completed: - puppetd... [12:12:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb2003.codfw.wmnet with OS bullseye [12:12:46] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetdb2003.codfw.wmnet with OS bullseye [12:13:36] (03PS2) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 [12:13:57] (03CR) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff) [12:14:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T329203)', diff saved to https://phabricator.wikimedia.org/P43992 and previous config saved to /var/cache/conftool/dbconfig/20230209-121446-marostegui.json [12:14:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [12:14:50] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:15:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [12:15:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T329203)', diff saved to https://phabricator.wikimedia.org/P43993 and previous config saved to /var/cache/conftool/dbconfig/20230209-121507-marostegui.json [12:15:19] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff) [12:16:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P43994 and previous config saved to /var/cache/conftool/dbconfig/20230209-121644-marostegui.json [12:16:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [12:16:48] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:16:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [12:17:04] !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc-gp1001.eqiad.wmnet with OS bullseye [12:17:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T328817)', diff saved to https://phabricator.wikimedia.org/P43995 and previous config saved to /var/cache/conftool/dbconfig/20230209-121705-marostegui.json [12:17:27] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move Babel settings from IS.php to ext-Babel.php, part III (T308932) (duration: 06m 47s) [12:17:30] (03PS3) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 [12:17:30] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [12:18:22] (03PS1) 10Hnowlan: api-gateway: reformat templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/887991 (https://phabricator.wikimedia.org/T329049) [12:18:48] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS buster [12:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T328817)', diff saved to https://phabricator.wikimedia.org/P43996 and previous config saved to /var/cache/conftool/dbconfig/20230209-121923-marostegui.json [12:19:41] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:19:42] (03CR) 10Jbond: "fly by comment" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [12:19:48] (03CR) 10Hnowlan: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum) [12:20:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T329203)', diff saved to https://phabricator.wikimedia.org/P43997 and previous config saved to /var/cache/conftool/dbconfig/20230209-122036-marostegui.json [12:20:40] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:21:29] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:17] !log phedenskog@deploy1002 Started deploy [performance/navtiming@bb224a1]: (no justification provided) [12:22:25] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@bb224a1]: (no justification provided) (duration: 00m 08s) [12:22:34] 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, and 2 others: Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10nfraison) I've looked back at the alerts we have faced on the 7th morning and those ones where due to a rol... [12:23:31] (03PS7) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 [12:23:40] (03CR) 10Jbond: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [12:27:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb2003.codfw.wmnet with reason: host reimage [12:28:21] (03PS1) 10Btullis: Update the kubectl config files generated for the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) [12:29:45] (03PS2) 10JMeybohm: k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) [12:30:23] (03CR) 10JMeybohm: [V: 03+1] k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:31:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:31:21] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39480/console" [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) (owner: 10Btullis) [12:31:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage [12:32:14] (03CR) 10Btullis: Update the kubectl config files generated for the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) (owner: 10Btullis) [12:32:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb2003.codfw.wmnet with reason: host reimage [12:33:43] (03CR) 10JMeybohm: "/cc Jesse - I think cfssl-issuer in aux is a copy-pase from dse? If so, it can/should be removed as well" [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) (owner: 10Btullis) [12:33:48] (03PS3) 10Jbond: rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 [12:33:54] (03CR) 10Jbond: [C: 03+2] rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond) [12:34:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P43998 and previous config saved to /var/cache/conftool/dbconfig/20230209-123430-marostegui.json [12:34:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage [12:35:15] (03CR) 10Jbond: [C: 03+2] rotate-snmp: convert to cookbook classes and use secrets for passwords (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond) [12:35:36] (03Merged) 10jenkins-bot: rotate-snmp: convert to cookbook classes and use secrets for passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/884996 (owner: 10Jbond) [12:35:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P43999 and previous config saved to /var/cache/conftool/dbconfig/20230209-123542-marostegui.json [12:37:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:38:44] (03CR) 10Ladsgroup: [C: 03+1] drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui) [12:39:17] (03CR) 10Marostegui: [C: 03+2] drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui) [12:39:19] (03PS1) 10Jbond: sre.pdus: correctly pass down doc string to arg parse method [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 [12:39:43] (03Merged) 10jenkins-bot: drop_cuc_comment_T329260.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887967 (https://phabricator.wikimedia.org/T329260) (owner: 10Marostegui) [12:41:24] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond) [12:42:17] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:42:18] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/886857 (https://phabricator.wikimedia.org/T329195) (owner: 10Cwhite) [12:46:13] (03CR) 10JMeybohm: [C: 03+1] Add sre.k8s.upgrade-cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [12:46:14] 10SRE-tools, 10Discovery-Search, 10Elasticsearch, 10Infrastructure-Foundations, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775 (10Volans) @bking also keep in mind that for spicerack we use debian packages, so unless we do pa... [12:46:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetdb2003.codfw.wmnet with OS bullseye [12:46:55] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetdb2003.codfw.wmnet with OS bullseye completed: - puppetd... [12:47:25] (03PS29) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [12:47:38] (03CR) 10Volans: [C: 04-1] "LGTM just one typo" [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond) [12:47:46] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [12:48:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:48:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:48:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44000 and previous config saved to /var/cache/conftool/dbconfig/20230209-124837-ladsgroup.json [12:48:41] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [12:48:51] !log joal@deploy1002 Started deploy [airflow-dags/analytics@cf9d978]: Fix analytics pageview_actor_hourly [12:49:04] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@cf9d978]: Fix analytics pageview_actor_hourly (duration: 00m 13s) [12:49:26] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff) [12:49:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P44001 and previous config saved to /var/cache/conftool/dbconfig/20230209-124936-marostegui.json [12:50:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P44002 and previous config saved to /var/cache/conftool/dbconfig/20230209-125048-marostegui.json [12:50:58] (KubernetesCalicoDown) firing: dse-k8s-worker1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:51:10] (03CR) 10Btullis: fix(varnishkafka): add alert duration of 5m to avoid false positive (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [12:52:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1001.eqiad.wmnet with OS buster [12:56:40] (03CR) 10Jaime Nuche: [C: 04-1] jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [12:58:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44003 and previous config saved to /var/cache/conftool/dbconfig/20230209-125803-ladsgroup.json [12:58:07] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [13:02:00] (03CR) 10Btullis: Remove the GPU configuration from an-worker109[67] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis) [13:04:20] (03CR) 10Jbond: [C: 03+1] ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [13:04:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T328817)', diff saved to https://phabricator.wikimedia.org/P44004 and previous config saved to /var/cache/conftool/dbconfig/20230209-130442-marostegui.json [13:04:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:04:47] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:04:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:05:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T328817)', diff saved to https://phabricator.wikimedia.org/P44005 and previous config saved to /var/cache/conftool/dbconfig/20230209-130504-marostegui.json [13:05:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T329203)', diff saved to https://phabricator.wikimedia.org/P44006 and previous config saved to /var/cache/conftool/dbconfig/20230209-130555-marostegui.json [13:05:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:05:58] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:06:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:08:38] (03CR) 10Jbond: [C: 03+1] "lgtm, ping me to merge" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [13:09:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T328817)', diff saved to https://phabricator.wikimedia.org/P44007 and previous config saved to /var/cache/conftool/dbconfig/20230209-130901-marostegui.json [13:09:39] (03PS1) 10Mazevedo: Add iOS stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697) [13:09:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [13:10:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [13:10:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T329203)', diff saved to https://phabricator.wikimedia.org/P44008 and previous config saved to /var/cache/conftool/dbconfig/20230209-131010-marostegui.json [13:10:19] (03PS1) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 [13:10:21] (03CR) 10CI reject: [V: 04-1] Add iOS stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697) (owner: 10Mazevedo) [13:10:40] (03PS2) 10Nicolas Fraison: fix(varnishkafka): add alert duration of 5m to avoid false positive [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) [13:12:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:12:55] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [13:13:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44009 and previous config saved to /var/cache/conftool/dbconfig/20230209-131309-ladsgroup.json [13:13:52] (03PS2) 10Mazevedo: Add iOS stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887998 (https://phabricator.wikimedia.org/T328697) [13:14:59] !log restarting Exim on MXes to pick up OpenSSL update [13:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T329203)', diff saved to https://phabricator.wikimedia.org/P44010 and previous config saved to /var/cache/conftool/dbconfig/20230209-131540-marostegui.json [13:15:44] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:16:25] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:44] (03CR) 10Nicolas Fraison: fix(varnishkafka): add alert duration of 5m to avoid false positive (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [13:19:59] (03CR) 10Jbond: [C: 03+2] phabricator: create phd home directory on service start [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [13:22:29] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:23:50] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10LSobanski) - GitLab failover requires a ~1.5h maintenance window during which GitLab will be unavailable. - We won'... [13:24:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P44011 and previous config saved to /var/cache/conftool/dbconfig/20230209-132407-marostegui.json [13:24:15] (03PS2) 10Jbond: sre.pdus: correctly pass down doc string to arg parse method [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 [13:24:52] (03PS4) 10Muehlenhoff: sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 [13:25:02] (03CR) 10Jbond: "fixed thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond) [13:25:09] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff) [13:26:53] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond) [13:27:39] !log phab2002: manually stopped `phd` service. It can't start due to the MariaDB server being set read-only and failed to start every 10 seconds since forever [13:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44012 and previous config saved to /var/cache/conftool/dbconfig/20230209-132815-ladsgroup.json [13:30:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P44013 and previous config saved to /var/cache/conftool/dbconfig/20230209-133046-marostegui.json [13:32:12] (03CR) 10Ayounsi: [C: 03+1] "LGTM! But hard to mentally parse it all." [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 (owner: 10Volans) [13:33:56] (03CR) 10JMeybohm: Add a spark-operator chart and helmfile configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [13:39:02] (03CR) 10Jaime Nuche: [C: 04-1] jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [13:39:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P44014 and previous config saved to /var/cache/conftool/dbconfig/20230209-133914-marostegui.json [13:39:30] (03CR) 10Ottomata: "Ah I see still a little WIP? Ping me when you'd like another review. Looking good!" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [13:40:15] !log restart prometheus-statsd-exporter on ores nodes to pick up label change - T325763 [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:18] T325763: Review ORES traffic to better understand Lift Wing's requirements - https://phabricator.wikimedia.org/T325763 [13:41:29] (03PS1) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) [13:43:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44016 and previous config saved to /var/cache/conftool/dbconfig/20230209-134322-ladsgroup.json [13:43:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:43:26] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [13:43:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:43:39] (03PS2) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) [13:43:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44017 and previous config saved to /var/cache/conftool/dbconfig/20230209-134343-ladsgroup.json [13:44:05] (03CR) 10Muehlenhoff: [C: 03+2] sre.hosts.reimage: Add proper error message if hostname is passed as FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/887976 (owner: 10Muehlenhoff) [13:45:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P44018 and previous config saved to /var/cache/conftool/dbconfig/20230209-134553-marostegui.json [13:47:11] (03PS3) 10Slyngshede: SUL account linking [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) [13:47:16] (03CR) 10Hashar: [C: 04-1] ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [13:50:18] (03CR) 10Jbond: "lgtm see nits below and grab a +1 from moritz before we merge" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [13:50:20] (03PS4) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 [13:50:31] (03PS2) 10Jbond: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [13:51:57] (03PS3) 10Jbond: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [13:53:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44019 and previous config saved to /var/cache/conftool/dbconfig/20230209-135309-ladsgroup.json [13:53:13] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [13:53:27] !log joal@deploy1002 Started deploy [airflow-dags/analytics@fbebd61]: Update analytics actor dags spark resources [13:53:41] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@fbebd61]: Update analytics actor dags spark resources (duration: 00m 13s) [13:53:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39482/console" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [13:54:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T328817)', diff saved to https://phabricator.wikimedia.org/P44020 and previous config saved to /var/cache/conftool/dbconfig/20230209-135420-marostegui.json [13:54:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:54:25] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:54:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:54:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2429.codfw.wmnet with OS buster [13:54:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44021 and previous config saved to /var/cache/conftool/dbconfig/20230209-135441-marostegui.json [13:54:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2429.codfw.wmnet with OS buster [13:55:22] (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi) [13:57:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44022 and previous config saved to /var/cache/conftool/dbconfig/20230209-135741-marostegui.json [13:58:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2430.codfw.wmnet with OS buster [13:58:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2430.codfw.wmnet with OS buster [13:58:41] (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline re: warning/critical in title" [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1400). [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:00:32] I’ll try to test the hacky fix I proposed at https://phabricator.wikimedia.org/T328634#8593132 later in the window [14:01:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T329203)', diff saved to https://phabricator.wikimedia.org/P44023 and previous config saved to /var/cache/conftool/dbconfig/20230209-140059-marostegui.json [14:01:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:01:03] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:01:11] (03CR) 10Jbond: [V: 03+1] "AKAICT this only affects the following classes and nothing in production" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [14:01:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:01:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [14:01:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [14:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44024 and previous config saved to /var/cache/conftool/dbconfig/20230209-140124-marostegui.json [14:02:51] (03PS4) 10Majavah: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 [14:02:54] (03CR) 10Jbond: [C: 03+1] ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [14:03:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) [14:03:12] (03CR) 10CI reject: [V: 04-1] apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [14:03:27] (03CR) 10Herron: [C: 03+1] Upgrade plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite) [14:03:44] (03CR) 10Slyngshede: [C: 03+2] C:IDM Enable the group creating pipeline. [puppet] - 10https://gerrit.wikimedia.org/r/886331 (owner: 10Slyngshede) [14:03:51] (03PS5) 10Majavah: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/887943 [14:04:33] (03CR) 10Majavah: apt::repository: use signed-by instead of apt-key (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [14:06:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39483/console" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [14:06:31] 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10SLyngshede-WMF) p:05Triage→03Low [14:06:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44025 and previous config saved to /var/cache/conftool/dbconfig/20230209-140650-marostegui.json [14:06:54] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:07:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:08:12] alright, I’ll do some testing on mwdebug1001, I hope nobody else is deploying there at the moment :) [14:08:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44026 and previous config saved to /var/cache/conftool/dbconfig/20230209-140815-ladsgroup.json [14:08:36] hrm [14:09:02] !log lucaswerkmeister-wmde@mwdebug1001:~$ mwscript namespaceDupes.php shnwikibooks --fix | tee T328634-1-unpatched.out # T328634 – finished successfully, to my surprise [14:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:06] T328634: Lost pages after deployed addtional namespaces on shn.wikibooks - https://phabricator.wikimedia.org/T328634 [14:09:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:21] (03PS1) 10Jelto: prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) [14:10:12] (03CR) 10Jbond: [C: 03+2] sre.pdus: correctly pass down doc string to arg parse method [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 (owner: 10Jbond) [14:10:16] (03PS3) 10Jbond: sre.pdus: correctly pass down doc string to arg parse method [cookbooks] - 10https://gerrit.wikimedia.org/r/887995 [14:11:17] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39484/console" [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto) [14:12:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P44027 and previous config saved to /var/cache/conftool/dbconfig/20230209-141247-marostegui.json [14:13:30] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [14:14:00] !log T329089: re-playing detected inconsistencies (missing mediawiki.page-undelete events) from 2022-10-31 to 2023-02-07 to WDQS [14:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:08] T329089: The rdf-streaming-updater does not reconcile missed page-undelete events - https://phabricator.wikimedia.org/T329089 [14:14:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2429.codfw.wmnet with reason: host reimage [14:14:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS buster [14:14:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster [14:15:00] alright, I’m done with my testing (and didn’t even end up editing any files) [14:16:07] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm, let's see if it works :)" [puppet] - 10https://gerrit.wikimedia.org/r/887872 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:16:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr) [14:17:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2429.codfw.wmnet with reason: host reimage [14:17:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr) removed gpu from an-worker1098, an-worker1099. installed both gpu into dse-k8s-worker... [14:19:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2430.codfw.wmnet with reason: host reimage [14:21:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2432.codfw.wmnet with OS buster [14:21:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2432.codfw.wmnet with OS buster [14:21:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P44028 and previous config saved to /var/cache/conftool/dbconfig/20230209-142157-marostegui.json [14:22:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2430.codfw.wmnet with reason: host reimage [14:23:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44029 and previous config saved to /var/cache/conftool/dbconfig/20230209-142321-ladsgroup.json [14:24:55] (03CR) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff) [14:25:57] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@dc3cd56]: T329089: proper reconciliation of missed page-undelete events [14:25:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:26:02] T329089: The rdf-streaming-updater does not reconcile missed page-undelete events - https://phabricator.wikimedia.org/T329089 [14:26:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:27:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:27:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) 05Open→03Resolved Great! Thanks @Jclark-ctr both cards detected. ` btullis@dse-k8s-wo... [14:27:30] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2053.codfw.wmnet with OS bullseye [14:27:45] (03CR) 10Btullis: [C: 03+2] Update the kubectl config files generated for the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/887994 (https://phabricator.wikimedia.org/T322635) (owner: 10Btullis) [14:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P44030 and previous config saved to /var/cache/conftool/dbconfig/20230209-142754-marostegui.json [14:27:59] (03PS5) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 [14:29:57] (03CR) 10Andrew Bogott: [C: 03+2] puppet: adapt replica_cnf_api to python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/887872 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:31:03] (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [14:31:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.703 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:31:48] (03CR) 10Filippo Giunchedi: [C: 03+2] "Verification happened on meet" [puppet] - 10https://gerrit.wikimedia.org/r/887937 (https://phabricator.wikimedia.org/T328787) (owner: 10Filippo Giunchedi) [14:31:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:31:57] (03PS2) 10Btullis: Remove the GPU configuration from an-worker109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) [14:32:12] (03CR) 10Volans: [C: 04-1] "Nice! It's ready to be tested, just one inverted check to fix." [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:32:26] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:32:58] (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (owner: 10Ayounsi) [14:33:43] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [14:34:15] (03PS28) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [14:34:22] (03PS4) 10Raymond Ndibe: puppet: modify role::wmcs::nfs::primary for replica_cnf api [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) [14:34:55] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:35:15] (03CR) 10Volans: [C: 03+1] "NICE!!! Let's start testing it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:35:45] (03PS2) 10EoghanGaffney: Try running docker before the base firewall rules are added [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) [14:36:08] (03CR) 10EoghanGaffney: Try running docker before the base firewall rules are added (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [14:37:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P44031 and previous config saved to /var/cache/conftool/dbconfig/20230209-143704-marostegui.json [14:37:07] (03CR) 10Volans: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [14:37:52] (03PS6) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [14:38:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44032 and previous config saved to /var/cache/conftool/dbconfig/20230209-143828-ladsgroup.json [14:38:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:38:32] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [14:38:41] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [14:38:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:38:53] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:39:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Initial IDM puppetisation - https://phabricator.wikimedia.org/T320428 (10SLyngshede-WMF) 05In progress→03Resolved Initial work is done, but is to come down the line. [14:39:21] (03CR) 10Volans: cookbooks.sre.elasticsearch.restart-nginx: New cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff) [14:39:24] 10SRE, 10Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405 (10SLyngshede-WMF) [14:39:37] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF All sub-tasks are now closed. [14:40:02] (03PS1) 10Mforns: analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933) [14:40:15] 10SRE, 10Data-Persistence, 10Discovery-Search, 10serviceops, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) >>! In T327920#8570661, @bd808 wrote: > #Toolhub does not have a working Kubernetes deployment outside of eqiad ({T28... [14:41:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2432.codfw.wmnet with reason: host reimage [14:42:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for the context in the commit message!" [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto) [14:43:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44033 and previous config saved to /var/cache/conftool/dbconfig/20230209-144300-marostegui.json [14:43:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [14:43:04] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:43:15] (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [14:43:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [14:43:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T328817)', diff saved to https://phabricator.wikimedia.org/P44034 and previous config saved to /var/cache/conftool/dbconfig/20230209-144321-marostegui.json [14:43:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2053.codfw.wmnet with reason: host reimage [14:44:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2432.codfw.wmnet with reason: host reimage [14:44:39] (03CR) 10Andrew Bogott: [C: 03+2] puppet: modify role::wmcs::nfs::primary for replica_cnf api [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [14:44:44] !log jiji@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:44:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:44:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2429.codfw.wmnet with OS buster [14:44:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2429.codfw.wmnet with OS buster completed: - mw2429 (**PASS**) - Removed from Pupp... [14:44:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:44:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2430.codfw.wmnet with OS buster [14:44:59] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:45:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2430.codfw.wmnet with OS buster completed: - mw2430 (**PASS**) - Removed from Pupp... [14:45:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [14:45:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [14:45:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44035 and previous config saved to /var/cache/conftool/dbconfig/20230209-144535-ladsgroup.json [14:45:39] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [14:46:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for SSH Access for kofori - https://phabricator.wikimedia.org/T328787 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @KOfori change is live and you should have full access in ~20 min. The bastions will be accessible already. See also https://wi... [14:46:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2433.codfw.wmnet with OS buster [14:46:45] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@dc3cd56]: T329089: proper reconciliation of missed page-undelete events (duration: 20m 48s) [14:46:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2433.codfw.wmnet with OS buster [14:46:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2053.codfw.wmnet with reason: host reimage [14:46:48] T329089: The rdf-streaming-updater does not reconcile missed page-undelete events - https://phabricator.wikimedia.org/T329089 [14:46:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2434.codfw.wmnet with OS buster [14:47:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2434.codfw.wmnet with OS buster [14:48:55] (03CR) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [14:49:15] (03PS7) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [14:49:56] !log jiji@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1001.eqiad.wmnet'] [14:50:51] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1001.eqiad.wmnet'] [14:51:44] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1001.eqiad.wmnet'] [14:51:49] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1001.eqiad.wmnet'] [14:52:00] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:52:03] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44036 and previous config saved to /var/cache/conftool/dbconfig/20230209-145210-marostegui.json [14:52:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:52:14] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:52:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:52:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44037 and previous config saved to /var/cache/conftool/dbconfig/20230209-145232-marostegui.json [14:52:36] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:52:39] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:54:30] (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [14:55:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44038 and previous config saved to /var/cache/conftool/dbconfig/20230209-145513-ladsgroup.json [14:55:17] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [14:55:42] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:56:11] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:56:33] !log jiji@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:57:01] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:57:22] !log jiji@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:58:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44039 and previous config saved to /var/cache/conftool/dbconfig/20230209-145811-marostegui.json [14:58:15] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:58:34] (03CR) 10Andrew Bogott: [C: 03+2] "we got impatient :)" [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [14:58:40] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1001.eqiad.wmnet [14:59:14] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:02:53] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) While not directly linked to the switchover as it does not have a codfw deployment, Toolhub will p... [15:03:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2053.codfw.wmnet with OS bullseye [15:03:42] (03PS1) 10Filippo Giunchedi: alertmanager: restore alert history feature on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/888027 (https://phabricator.wikimedia.org/T329294) [15:04:44] (03CR) 10Jbond: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [15:04:55] (03CR) 10Btullis: analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933) (owner: 10Mforns) [15:04:59] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet [15:05:25] (03CR) 10Ayounsi: "Looking more into is, the Calico configuration knob `keepOriginalNextHop` implemented in https://github.com/projectcalico/libcalico-go/pul" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) (owner: 10Alexandros Kosiaris) [15:06:18] (03CR) 10Herron: [C: 03+1] "LGTM for v0" [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [15:06:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2434.codfw.wmnet with reason: host reimage [15:07:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 (owner: 10Volans) [15:07:17] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2054.codfw.wmnet with OS bullseye [15:08:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:08:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2432.codfw.wmnet with OS buster [15:08:36] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2432.codfw.wmnet with OS buster completed: - mw2432 (**PASS**) - Removed from Pupp... [15:09:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2435.codfw.wmnet with OS buster [15:09:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2435.codfw.wmnet with OS buster [15:09:16] (03CR) 10Jbond: [C: 03+1] "lgtm lets give it shot" [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [15:09:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2434.codfw.wmnet with reason: host reimage [15:09:35] (03CR) 10Herron: [C: 03+1] alertmanager: restore alert history feature on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/888027 (https://phabricator.wikimedia.org/T329294) (owner: 10Filippo Giunchedi) [15:10:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44040 and previous config saved to /var/cache/conftool/dbconfig/20230209-151019-ladsgroup.json [15:10:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2431.codfw.wmnet with OS buster [15:11:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster executed with errors: - mw2431 (**FAIL**) - Remove... [15:11:47] (03CR) 10Herron: [C: 03+2] "cheers thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/887804 (owner: 10Herron) [15:12:14] !log jiji@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mc-gp1001.eqiad.wmnet [15:13:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P44041 and previous config saved to /var/cache/conftool/dbconfig/20230209-151317-marostegui.json [15:16:15] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1001.eqiad.wmnet [15:17:29] PROBLEM - Check systemd state on graphite2004 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:32] (03CR) 10Nicolas Fraison: fix(varnishkafka): add alert duration of 5m to avoid false positive (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [15:22:12] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Santiago Faci - https://phabricator.wikimedia.org/T329296 (10Sfaci) [15:22:18] (03PS1) 10Hnowlan: Pin setuptools and packaging versions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) [15:22:24] 10Puppet, 10Infrastructure-Foundations: pupetmastrs: investigate if the puppetmasteres still need a checkout of operations/software - https://phabricator.wikimedia.org/T329297 (10jbond) p:05Triage→03Medium [15:23:05] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:23:28] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2054.codfw.wmnet with reason: host reimage [15:23:32] (03CR) 10CI reject: [V: 04-1] Pin setuptools and packaging versions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) (owner: 10Hnowlan) [15:23:34] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: restore alert history feature on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/888027 (https://phabricator.wikimedia.org/T329294) (owner: 10Filippo Giunchedi) [15:24:20] !log jiji@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet [15:25:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44042 and previous config saved to /var/cache/conftool/dbconfig/20230209-152525-ladsgroup.json [15:25:54] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2054.codfw.wmnet with reason: host reimage [15:28:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P44043 and previous config saved to /var/cache/conftool/dbconfig/20230209-152824-marostegui.json [15:31:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [15:31:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:31:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2434.codfw.wmnet with OS buster [15:31:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2434.codfw.wmnet with OS buster completed: - mw2434 (**PASS**) - Removed from Pupp... [15:34:18] (03PS2) 10Mforns: analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933) [15:34:34] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet [15:34:34] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts mc-gp1001.eqiad.wmnet [15:34:35] 10SRE, 10Wikimedia-Mailing-lists, 10User-MarcoAurelio: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10MarcoAurelio) 05Open→03Resolved a:03MarcoAurelio This was fixed in [[ https://gitlab.com/mailman/django-mailman3/-/commit/31c6ae825fa055... [15:38:25] 10SRE, 10Wikimedia-Mailing-lists: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10MarcoAurelio) a:05MarcoAurelio→03None [15:39:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS buster [15:39:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1001.eqiad.wmnet with OS bullseye [15:39:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster [15:39:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2431.codfw.wmnet with OS buster [15:39:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster executed with errors: - mw2431 (**FAIL**) - Remove... [15:40:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44044 and previous config saved to /var/cache/conftool/dbconfig/20230209-154032-ladsgroup.json [15:40:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:40:35] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [15:40:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:40:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [15:40:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [15:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44045 and previous config saved to /var/cache/conftool/dbconfig/20230209-154058-ladsgroup.json [15:41:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS buster [15:41:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster [15:41:52] (03PS1) 10Vgutierrez: varnish: Perform ESI processing on wiki pages [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799) [15:41:59] (03PS3) 10JHathaway: Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) [15:42:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2054.codfw.wmnet with OS bullseye [15:42:49] (03CR) 10EoghanGaffney: [C: 03+2] Try running docker before the base firewall rules are added [puppet] - 10https://gerrit.wikimedia.org/r/887983 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [15:42:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2433.codfw.wmnet with OS buster [15:43:02] (03CR) 10JHathaway: Add jaeger-es-index-cleaner (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [15:43:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2433.codfw.wmnet with OS buster executed with errors: - mw2433 (**FAIL**) - Remove... [15:43:13] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10MarcoAurelio) Pardon my ignorance but are partial i18n updates possible (e.g. [[ https://gitlab.com/mailman/django-mailman3/-/tree/master/django_mailman3/lo... [15:43:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44046 and previous config saved to /var/cache/conftool/dbconfig/20230209-154330-marostegui.json [15:43:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:43:34] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:43:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:43:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T328817)', diff saved to https://phabricator.wikimedia.org/P44047 and previous config saved to /var/cache/conftool/dbconfig/20230209-154337-marostegui.json [15:43:41] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:43:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44048 and previous config saved to /var/cache/conftool/dbconfig/20230209-154347-marostegui.json [15:46:23] (03PS1) 10Filippo Giunchedi: admin: add Santiago Faci [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296) [15:49:08] (03PS1) 10Herron: statsd_proxy: add ipv6only=1 to socat relay config [puppet] - 10https://gerrit.wikimedia.org/r/888046 [15:49:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44049 and previous config saved to /var/cache/conftool/dbconfig/20230209-154919-marostegui.json [15:49:23] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:50:05] (03CR) 10Herron: [C: 03+2] statsd_proxy: add ipv6only=1 to socat relay config [puppet] - 10https://gerrit.wikimedia.org/r/888046 (owner: 10Herron) [15:50:10] (03CR) 10Stevemunene: [C: 03+2] analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933) (owner: 10Mforns) [15:50:17] (03CR) 10Jelto: [V: 03+1] prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto) [15:50:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44050 and previous config saved to /var/cache/conftool/dbconfig/20230209-155019-ladsgroup.json [15:50:23] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [15:51:57] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 (10Clement_Goubert) Dashboard available: https://logstash.wikimedia.org/app/dashboards#/view/74557260-a88f-11ed-96bb-4b4732aa077a [15:52:10] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage [15:52:21] (03CR) 10Stevemunene: [C: 03+2] analytics::refinery::job::druid_load.pp: Absent 3 jobs to migrate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888018 (https://phabricator.wikimedia.org/T328933) (owner: 10Mforns) [15:53:03] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888009 (https://phabricator.wikimedia.org/T328972) (owner: 10Jelto) [15:54:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1001.eqiad.wmnet with reason: host reimage [15:54:59] (03PS8) 10Ayounsi: [WIP] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [15:55:17] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2055.codfw.wmnet with OS bullseye [15:55:24] !log restart esitest.service on A:cp-text [15:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:40] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1002.eqiad.wmnet [15:55:47] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp1002.eqiad.wmnet [15:56:02] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp1002.eqiad.wmnet [15:56:03] (03CR) 10Ayounsi: "As we're adding 1.16 vs. 1.23 conditionals, if we merge that, we need to add the relevant cleanups to https://phabricator.wikimedia.org/T3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [15:56:35] !log jiji@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts mc-gp1002.eqiad.wmnet [15:58:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P44051 and previous config saved to /var/cache/conftool/dbconfig/20230209-155843-marostegui.json [16:02:07] (03CR) 10Clément Goubert: [C: 03+1] "Builds fine for me" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [16:02:40] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner1004.eqiad.wmnet with OS bullseye [16:04:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P44052 and previous config saved to /var/cache/conftool/dbconfig/20230209-160425-marostegui.json [16:05:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2435.codfw.wmnet with OS buster [16:05:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44053 and previous config saved to /var/cache/conftool/dbconfig/20230209-160525-ladsgroup.json [16:05:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2435.codfw.wmnet with OS buster executed with errors: - mw2435 (**FAIL**) - Remove... [16:06:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "Can confirm! Builds for me" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [16:07:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296) (owner: 10Filippo Giunchedi) [16:08:26] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add Santiago Faci [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296) (owner: 10Filippo Giunchedi) [16:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:09:21] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc2055.codfw.wmnet with OS bullseye [16:09:41] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2055.codfw.wmnet with OS bullseye [16:10:24] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Santiago Faci - https://phabricator.wikimedia.org/T329296 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @Sfaci you are now in the `wmf` LDAP group. I'm optimistically resolving the task, though feel free to reopen if some... [16:10:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1001.eqiad.wmnet with OS bullseye [16:13:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P44054 and previous config saved to /var/cache/conftool/dbconfig/20230209-161349-marostegui.json [16:13:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:14:29] RECOVERY - Check systemd state on graphite2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:31] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage [16:17:01] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage [16:18:17] (03PS2) 10Vgutierrez: varnish: Perform ESI processing on wiki pages [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799) [16:19:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P44055 and previous config saved to /var/cache/conftool/dbconfig/20230209-161931-marostegui.json [16:20:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44056 and previous config saved to /var/cache/conftool/dbconfig/20230209-162032-ladsgroup.json [16:20:55] (03CR) 10BBlack: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [16:25:32] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2055.codfw.wmnet with reason: host reimage [16:25:51] (03CR) 10Volans: [V: 03+2 C: 03+2] Add Makefile.deploy for the deploy cookbook [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/887982 (owner: 10Volans) [16:27:18] (03CR) 10Elukey: [C: 03+2] Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:27:25] (03PS29) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [16:28:16] (03PS1) 10Ahmon Dancy: logspam.pl: Filter out some persistent noise [puppet] - 10https://gerrit.wikimedia.org/r/888050 (https://phabricator.wikimedia.org/T323254) [16:28:23] (03CR) 10Hashar: jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [16:28:34] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2055.codfw.wmnet with reason: host reimage [16:28:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T328817)', diff saved to https://phabricator.wikimedia.org/P44057 and previous config saved to /var/cache/conftool/dbconfig/20230209-162855-marostegui.json [16:28:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:28:59] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:29:02] (03CR) 10Ahmon Dancy: "Tested on mwlog1002" [puppet] - 10https://gerrit.wikimedia.org/r/888050 (https://phabricator.wikimedia.org/T323254) (owner: 10Ahmon Dancy) [16:29:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:29:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44058 and previous config saved to /var/cache/conftool/dbconfig/20230209-162927-marostegui.json [16:31:16] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [16:31:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:31:30] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39486/console" [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [16:32:57] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [16:33:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44059 and previous config saved to /var/cache/conftool/dbconfig/20230209-163327-marostegui.json [16:34:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44060 and previous config saved to /var/cache/conftool/dbconfig/20230209-163438-marostegui.json [16:34:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:34:42] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [16:34:46] (03PS3) 10Andrea Denisse: centrallog: Add centrallog1001 to quickdatacopy allow hosts [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) [16:34:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:34:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44061 and previous config saved to /var/cache/conftool/dbconfig/20230209-163459-marostegui.json [16:35:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44062 and previous config saved to /var/cache/conftool/dbconfig/20230209-163538-ladsgroup.json [16:35:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:35:42] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [16:35:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:36:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44063 and previous config saved to /var/cache/conftool/dbconfig/20230209-163559-ladsgroup.json [16:36:05] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39487/console" [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [16:36:33] (03CR) 10Brennen Bearnes: [C: 03+1] logspam.pl: Filter out some persistent noise [puppet] - 10https://gerrit.wikimedia.org/r/888050 (https://phabricator.wikimedia.org/T323254) (owner: 10Ahmon Dancy) [16:36:34] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@caf4808]: T329089: proper reconciliation of missed page-undelete events [16:36:37] T329089: The rdf-streaming-updater does not reconcile missed page-undelete events - https://phabricator.wikimedia.org/T329089 [16:36:47] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite) [16:37:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44064 and previous config saved to /var/cache/conftool/dbconfig/20230209-163720-marostegui.json [16:37:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2431.codfw.wmnet with OS buster [16:37:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster executed with errors: - mw2431 (**FAIL**) - Remove... [16:38:09] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Perform ESI processing on wiki pages [puppet] - 10https://gerrit.wikimedia.org/r/888044 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [16:38:11] (03PS4) 10Andrea Denisse: centrallog: Enable auto_ferm_ipv6 to quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) [16:38:58] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@caf4808]: T329089: proper reconciliation of missed page-undelete events (duration: 02m 24s) [16:39:00] (03PS1) 10Muehlenhoff: Add safe.directory directives for the puppet master [puppet] - 10https://gerrit.wikimedia.org/r/888053 [16:39:30] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39488/console" [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [16:40:53] (03CR) 10Elukey: services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [16:41:40] (03CR) 10Andrea Denisse: "Hi, I enabled auto_ferm_ipv6 to open the firewall ports and sync the instances." [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [16:43:08] (03PS2) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 [16:44:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr can I get an update on the situation here / estimate of when we might be able to add the 4 links detailed above? Ping... [16:44:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2055.codfw.wmnet with OS bullseye [16:44:52] !log installing curl security updates on buster [16:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff) [16:45:10] (03CR) 10CI reject: [V: 04-1] cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff) [16:45:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44065 and previous config saved to /var/cache/conftool/dbconfig/20230209-164525-ladsgroup.json [16:45:29] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [16:46:10] (03PS5) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [16:48:29] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1004.eqiad.wmnet with OS bullseye [16:48:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P44066 and previous config saved to /var/cache/conftool/dbconfig/20230209-164834-marostegui.json [16:51:16] (03PS3) 10Muehlenhoff: cookbooks.sre.elasticsearch.restart-nginx: New cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 [16:52:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P44067 and previous config saved to /var/cache/conftool/dbconfig/20230209-165226-marostegui.json [16:56:18] (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:58:08] RECOVERY - Disk space on an-airflow1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-airflow1001&var-datasource=eqiad+prometheus/ops [17:00:04] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44068 and previous config saved to /var/cache/conftool/dbconfig/20230209-170031-ladsgroup.json [17:01:50] (03CR) 10Jbond: "lgtm but may make more senses to go in puppetmaster::gitclone (which its self shuld probably be a profile but that's a different matter)" [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff) [17:02:24] 10SRE, 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10wiki_willy) a:03Jclark-ctr [17:03:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P44069 and previous config saved to /var/cache/conftool/dbconfig/20230209-170340-marostegui.json [17:06:11] 10ops-eqiad, 10DC-Ops: hw troubleshooting: for - https://phabricator.wikimedia.org/T329305 (10wiki_willy) [17:06:38] (03PS1) 10EoghanGaffney: Insert an empty DOCKER-ISOLATION-STAGE-1 chain into the ferm templates [puppet] - 10https://gerrit.wikimedia.org/r/888057 (https://phabricator.wikimedia.org/T329035) [17:06:51] 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10wiki_willy) [17:07:19] !log rolling restart of FPM/Apache on mw canaries to pick up curl security updates [17:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P44070 and previous config saved to /var/cache/conftool/dbconfig/20230209-170732-marostegui.json [17:08:46] 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10wiki_willy) [17:12:22] 10SRE: add Hal Triedman (htriedman) to ops-l mailing list - https://phabricator.wikimedia.org/T329209 (10Htriedman) @fgiunchedi I just signed up via lists.wikimedia.org! Thanks for getting back to me. [17:14:08] (03PS2) 10Muehlenhoff: Add safe.directory directives for the puppet master [puppet] - 10https://gerrit.wikimedia.org/r/888053 [17:14:55] (03CR) 10Jelto: [C: 03+1] "lgtm to add a empty chain." [puppet] - 10https://gerrit.wikimedia.org/r/888057 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [17:15:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44071 and previous config saved to /var/cache/conftool/dbconfig/20230209-171539-ladsgroup.json [17:17:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff) [17:18:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T328817)', diff saved to https://phabricator.wikimedia.org/P44072 and previous config saved to /var/cache/conftool/dbconfig/20230209-171846-marostegui.json [17:18:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:18:50] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:19:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:20:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [17:21:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [17:21:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:21:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:21:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44073 and previous config saved to /var/cache/conftool/dbconfig/20230209-172129-marostegui.json [17:21:36] (03CR) 10Hnowlan: [C: 03+1] services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [17:22:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44074 and previous config saved to /var/cache/conftool/dbconfig/20230209-172239-marostegui.json [17:22:43] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [17:25:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44075 and previous config saved to /var/cache/conftool/dbconfig/20230209-172524-marostegui.json [17:25:28] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:30:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44076 and previous config saved to /var/cache/conftool/dbconfig/20230209-173045-ladsgroup.json [17:30:48] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-02-06-121917-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/888061 [17:30:49] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [17:31:36] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [17:32:07] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@e84e692]: (no justification provided) [17:32:24] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@e84e692]: (no justification provided) (duration: 00m 16s) [17:33:53] (03Abandoned) 10Andrea Denisse: centrallog: Enable auto_ferm_ipv6 to quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/887812 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [17:40:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P44077 and previous config saved to /var/cache/conftool/dbconfig/20230209-174030-marostegui.json [17:41:05] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-02-06-121917-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/888061 (owner: 10BryanDavis) [17:41:06] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp2001.codfw.wmnet [17:41:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10AKhatun_WMF) Thank you, accessed! [17:43:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2431.codfw.wmnet with OS buster [17:44:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster [17:46:05] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-02-06-121917-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/888061 (owner: 10BryanDavis) [17:49:09] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T328420 (10wiki_willy) a:03Papaul [17:50:41] !log jiji@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mc-gp2001.codfw.wmnet [17:51:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10wiki_willy) a:03Papaul [17:51:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2433.codfw.wmnet with OS buster [17:51:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2433.codfw.wmnet with OS buster [17:55:04] (03PS1) 10Herron: rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 [17:55:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P44078 and previous config saved to /var/cache/conftool/dbconfig/20230209-175536-marostegui.json [17:57:11] (03PS1) 10MusikAnimal: InitialiseSettings: install PageAssessments on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888066 (https://phabricator.wikimedia.org/T328224) [18:00:05] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1800). [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1800) [18:00:48] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp2001.codfw.wmnet [18:00:52] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mc-gp2001.codfw.wmnet [18:01:04] !log jiji@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mc-gp2001.codfw.wmnet [18:01:41] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:02:05] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:02:05] !log jiji@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet [18:02:20] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:03:05] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:03:12] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:03:55] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:04:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2431.codfw.wmnet with reason: host reimage [18:05:20] (03CR) 10Herron: "Came across these inline lookups while prepping rsync transfers between centrallog hosts. Proposing we simply get rid of them since permi" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [18:07:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2431.codfw.wmnet with reason: host reimage [18:08:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2435.codfw.wmnet with OS buster [18:08:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2435.codfw.wmnet with OS buster [18:09:15] !log jiji@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [18:09:36] !log jiji@cumin1001 Updating IPMI password on 1 hosts - jiji@cumin1001 [18:09:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [18:10:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T328817)', diff saved to https://phabricator.wikimedia.org/P44079 and previous config saved to /var/cache/conftool/dbconfig/20230209-181043-marostegui.json [18:10:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [18:10:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2433.codfw.wmnet with reason: host reimage [18:10:47] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:11:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [18:11:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T328817)', diff saved to https://phabricator.wikimedia.org/P44080 and previous config saved to /var/cache/conftool/dbconfig/20230209-181115-marostegui.json [18:11:59] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet [18:12:00] !log jiji@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts mc-gp2001.codfw.wmnet [18:13:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T328817)', diff saved to https://phabricator.wikimedia.org/P44081 and previous config saved to /var/cache/conftool/dbconfig/20230209-181353-marostegui.json [18:13:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2433.codfw.wmnet with reason: host reimage [18:20:59] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:22:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp2001.codfw.wmnet with OS bullseye [18:26:17] (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [18:28:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2435.codfw.wmnet with reason: host reimage [18:28:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P44082 and previous config saved to /var/cache/conftool/dbconfig/20230209-182859-marostegui.json [18:30:12] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:32:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2435.codfw.wmnet with reason: host reimage [18:32:54] 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10jijiki) [18:32:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:32:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2433.codfw.wmnet with OS buster [18:33:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:33:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2431.codfw.wmnet with OS buster [18:33:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2433.codfw.wmnet with OS buster completed: - mw2433 (**PASS**) - Removed from Pupp... [18:33:09] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2431.codfw.wmnet with OS buster completed: - mw2431 (**PASS**) - Removed from Pupp... [18:34:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [18:36:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [18:36:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [18:36:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44083 and previous config saved to /var/cache/conftool/dbconfig/20230209-183611-ladsgroup.json [18:36:15] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [18:38:03] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2001.codfw.wmnet with reason: host reimage [18:40:37] (03CR) 10Jbond: "thanks for the follow ups 😊" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [18:41:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2001.codfw.wmnet with reason: host reimage [18:42:58] 10SRE, 10LDAP-Access-Requests: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10jon_amar-WMDE) [18:44:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P44084 and previous config saved to /var/cache/conftool/dbconfig/20230209-184405-marostegui.json [18:44:21] (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [18:45:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44085 and previous config saved to /var/cache/conftool/dbconfig/20230209-184538-ladsgroup.json [18:45:42] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [18:48:15] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:49:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:49:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2435.codfw.wmnet with OS buster [18:49:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2435.codfw.wmnet with OS buster completed: - mw2435 (**PASS**) - Removed from Pupp... [18:55:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2001.codfw.wmnet with OS bullseye [18:56:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [18:56:55] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Reedy) [18:59:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T328817)', diff saved to https://phabricator.wikimedia.org/P44086 and previous config saved to /var/cache/conftool/dbconfig/20230209-185912-marostegui.json [18:59:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:59:16] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:59:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:59:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T328817)', diff saved to https://phabricator.wikimedia.org/P44087 and previous config saved to /var/cache/conftool/dbconfig/20230209-185933-marostegui.json [19:00:04] ^demon and dancy: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T1900). [19:00:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44088 and previous config saved to /var/cache/conftool/dbconfig/20230209-190044-ladsgroup.json [19:01:12] !log start full-cluster in-place reindexing of all wiki elasticsearch clusters T147505 [19:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:15] T147505: [tracking] CirrusSearch: what is updated during re-indexing - https://phabricator.wikimedia.org/T147505 [19:02:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T328817)', diff saved to https://phabricator.wikimedia.org/P44089 and previous config saved to /var/cache/conftool/dbconfig/20230209-190211-marostegui.json [19:15:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P44090 and previous config saved to /var/cache/conftool/dbconfig/20230209-191551-ladsgroup.json [19:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P44091 and previous config saved to /var/cache/conftool/dbconfig/20230209-191717-marostegui.json [19:18:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) @cmooney sorry for delay finished connecting links and updated cableid's [19:30:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T328255)', diff saved to https://phabricator.wikimedia.org/P44092 and previous config saved to /var/cache/conftool/dbconfig/20230209-193057-ladsgroup.json [19:30:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:31:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:31:02] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [19:31:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44093 and previous config saved to /var/cache/conftool/dbconfig/20230209-193107-ladsgroup.json [19:32:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P44094 and previous config saved to /var/cache/conftool/dbconfig/20230209-193223-marostegui.json [19:36:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [19:40:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44095 and previous config saved to /var/cache/conftool/dbconfig/20230209-194032-ladsgroup.json [19:40:37] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [19:41:31] (03PS4) 10Bking: elastic: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) [19:42:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [19:46:48] (03PS5) 10Bking: elastic relforge: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) [19:47:04] (03PS6) 10Ryan Kemper: elastic relforge: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [19:47:09] (03CR) 10Ryan Kemper: [C: 03+1] elastic relforge: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [19:47:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T328817)', diff saved to https://phabricator.wikimedia.org/P44096 and previous config saved to /var/cache/conftool/dbconfig/20230209-194730-marostegui.json [19:47:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:47:34] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:47:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:47:47] (03CR) 10Bking: [C: 03+2] elastic relforge: add ESJsonLayout log config [puppet] - 10https://gerrit.wikimedia.org/r/885439 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [19:55:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44097 and previous config saved to /var/cache/conftool/dbconfig/20230209-195539-ladsgroup.json [19:57:00] (03PS1) 10Bking: elastic relforge: update logstash transport [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) [20:03:55] (03PS2) 10Bking: elastic relforge: update logstash transport [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) [20:05:08] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [20:05:51] (03PS3) 10Bking: elastic relforge: update logstash transport [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) [20:07:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [20:10:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P44098 and previous config saved to /var/cache/conftool/dbconfig/20230209-201045-ladsgroup.json [20:12:11] (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39490/console" [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [20:17:56] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-cephleaks.py: add 'delete' functionality [puppet] - 10https://gerrit.wikimedia.org/r/887789 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [20:19:50] 10SRE, 10LDAP-Access-Requests: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10WMDE-leszek) I endorse the request on WMDE's behalf, and confirm the identity of @jon_amar-WMDE. [20:20:40] (03PS1) 10Mforns: analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933) [20:22:06] (03PS2) 10Mforns: analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933) [20:22:18] (03PS3) 10Mforns: analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933) [20:23:15] (03CR) 10Ryan Kemper: [C: 03+1] "PCC looks as expected" [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [20:23:25] (03CR) 10Bking: [V: 03+1 C: 03+2] elastic relforge: update logstash transport [puppet] - 10https://gerrit.wikimedia.org/r/888078 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [20:25:42] (03PS4) 10Mforns: analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933) [20:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T328255)', diff saved to https://phabricator.wikimedia.org/P44099 and previous config saved to /var/cache/conftool/dbconfig/20230209-202551-ladsgroup.json [20:25:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:25:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:25:57] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [20:27:14] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335 [20:27:18] T324335: Remove logstash from the Search Elasticsearch servers - https://phabricator.wikimedia.org/T324335 [20:31:01] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335 [20:32:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [20:32:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [20:32:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44100 and previous config saved to /var/cache/conftool/dbconfig/20230209-203236-ladsgroup.json [20:32:40] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [20:34:08] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10Wikimedia-production-error: Steady rate of Phonos Swift errors (inc. DescribeFileOp failed, FileBackendStore::ingestFreshFileStats: Could not stat) - https://phabricator.wikimedia.org/T329249 (10Aklapper) [20:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44101 and previous config saved to /var/cache/conftool/dbconfig/20230209-204214-ladsgroup.json [20:42:18] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [20:47:01] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335 [20:47:04] T324335: Remove logstash from the Search Elasticsearch servers - https://phabricator.wikimedia.org/T324335 [20:50:52] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge logging config change - bking@cumin1001 - T324335 [20:57:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44102 and previous config saved to /var/cache/conftool/dbconfig/20230209-205720-ladsgroup.json [21:00:04] brennen and TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230209T2100). [21:00:04] musikanimal: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:40] musikanimal: around for backport? I can deploy [21:02:07] (03CR) 10RLazarus: [C: 03+1] slo_dashboards: dynamic slo dashboard panels (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [21:02:10] o/ [21:02:17] cool :) [21:05:15] (03PS1) 10Andrew Bogott: wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check. [puppet] - 10https://gerrit.wikimedia.org/r/888087 (https://phabricator.wikimedia.org/T289623) [21:07:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888066 (https://phabricator.wikimedia.org/T328224) (owner: 10MusikAnimal) [21:08:55] (03Merged) 10jenkins-bot: InitialiseSettings: install PageAssessments on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888066 (https://phabricator.wikimedia.org/T328224) (owner: 10MusikAnimal) [21:09:19] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:888066|InitialiseSettings: install PageAssessments on newiki (T328224)]] [21:09:26] T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224 [21:11:11] !log thcipriani@deploy1002 musikanimal and thcipriani: Backport for [[gerrit:888066|InitialiseSettings: install PageAssessments on newiki (T328224)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:11:28] ^ musikanimal your patch is on mwdebug, check please :) [21:11:34] checking! [21:12:11] db error. Did you run update.php? (sorry I don't know how this works for deployers) [21:12:21] I guess I should have said that beforehand, sorry [21:12:23] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check. [puppet] - 10https://gerrit.wikimedia.org/r/888087 (https://phabricator.wikimedia.org/T289623) (owner: 10Andrew Bogott) [21:12:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P44103 and previous config saved to /var/cache/conftool/dbconfig/20230209-211226-ladsgroup.json [21:12:53] `mwscript extensions/WikimediaMaintenance/createExtensionTables.php newiki pageassessments` [21:12:57] on mwmaint1002 [21:13:02] oof, musikanimal no, sorry, we don't run update.php as part of deploy. Usually folks sync up with the dba before hand to do that. [21:13:13] bah [21:13:17] :( [21:13:18] okay, this can wait if it needs to [21:13:27] I would prefer that [21:13:32] okay no problem :) [21:13:39] thanks and sorry, reverting [21:13:44] !log thcipriani@deploy1002 sync-world aborted: Backport for [[gerrit:888066|InitialiseSettings: install PageAssessments on newiki (T328224)]] (duration: 04m 24s) [21:13:44] !log thcipriani@deploy1002 backport aborted: (duration: 06m 05s) [21:13:53] my fault! I should read the docs or something [21:14:27] (03PS1) 10TrainBranchBot: Revert "InitialiseSettings: install PageAssessments on newiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888089 [21:14:29] (03CR) 10TrainBranchBot: "thcipriani@deploy1002 created a revert of this change as I0972d873ffaa4106a0bec64e758729e243bf8896" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888066 (https://phabricator.wikimedia.org/T328224) (owner: 10MusikAnimal) [21:15:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888089 (owner: 10TrainBranchBot) [21:16:39] (03Merged) 10jenkins-bot: Revert "InitialiseSettings: install PageAssessments on newiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888089 (owner: 10TrainBranchBot) [21:17:02] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10jijiki) [21:17:04] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:888089|Revert "InitialiseSettings: install PageAssessments on newiki"]] [21:18:56] !log thcipriani@deploy1002 trainbranchbot and thcipriani: Backport for [[gerrit:888089|Revert "InitialiseSettings: install PageAssessments on newiki"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:19:32] alright, should be all reset on the mwdebug servers [21:19:37] !log thcipriani@deploy1002 Sync cancelled. [21:27:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T328255)', diff saved to https://phabricator.wikimedia.org/P44104 and previous config saved to /var/cache/conftool/dbconfig/20230209-212732-ladsgroup.json [21:27:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [21:27:37] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:27:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [21:27:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [21:27:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [21:27:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44105 and previous config saved to /var/cache/conftool/dbconfig/20230209-212747-ladsgroup.json [21:29:32] 10SRE, 10Traffic: create a puppetized abstraction for haproxy blocklist hysteresis - https://phabricator.wikimedia.org/T329331 (10CDanis) [21:36:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44106 and previous config saved to /var/cache/conftool/dbconfig/20230209-213607-ladsgroup.json [21:36:11] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:47:24] 10SRE, 10Traffic, 10Data Pipelines (Sprint 08): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10Snwachukwu) Here is a google [[ https://docs.google.com/document/d/1rz7L24EVECOKYGhn-GTUIGo3NmbCIGCQW9XXLNrxCEM/edit# | doc ]] containing a draft [21:51:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44107 and previous config saved to /var/cache/conftool/dbconfig/20230209-215114-ladsgroup.json [22:00:58] (03PS1) 10Zabe: Start reading from rev_comment_id in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888093 (https://phabricator.wikimedia.org/T275246) [22:06:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P44108 and previous config saved to /var/cache/conftool/dbconfig/20230209-220620-ladsgroup.json [22:15:12] (03Abandoned) 10Cwhite: logstash: migrate mediawiki_ecs to ecs 1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/831952 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite) [22:16:06] (03CR) 10Cwhite: [C: 03+2] logstash: enable error.stack.previous_trace [puppet] - 10https://gerrit.wikimedia.org/r/886863 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite) [22:21:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T328255)', diff saved to https://phabricator.wikimedia.org/P44109 and previous config saved to /var/cache/conftool/dbconfig/20230209-222126-ladsgroup.json [22:21:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [22:21:31] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:21:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [22:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44110 and previous config saved to /var/cache/conftool/dbconfig/20230209-222137-ladsgroup.json [22:25:53] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 1893 MB (3% inode=97%): /srv/swift-storage/sda3 10727 MB (5% inode=99%): /tmp 1893 MB (3% inode=97%): /var/tmp 1893 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [22:26:13] (03PS1) 10Volans: debmonitorgc: garbage collect also stale Hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/888095 [22:30:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44111 and previous config saved to /var/cache/conftool/dbconfig/20230209-223003-ladsgroup.json [22:30:07] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:36:13] 10SRE, 10Data-Persistence, 10Discovery-Search, 10serviceops, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10bd808) >>! In T329193#8601521, @Clement_Goubert wrote: >>>! In T327920#8570661, @bd808 wrote: >> #Toolhub does not have a working Kube... [22:40:32] jouncebot, nowandnext [22:40:32] No deployments scheduled for the next 8 hour(s) and 19 minute(s) [22:40:32] In 8 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230210T0700) [22:40:50] (03CR) 10Zabe: [C: 03+2] Start reading from rev_comment_id in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888093 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe) [22:41:56] (03Merged) 10jenkins-bot: Start reading from rev_comment_id in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888093 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe) [22:42:28] !log zabe@deploy1002 Started scap: Backport for [[gerrit:888093|Start reading from rev_comment_id in cebwiki (T275246)]] [22:42:31] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [22:44:20] !log zabe@deploy1002 zabe: Backport for [[gerrit:888093|Start reading from rev_comment_id in cebwiki (T275246)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:45:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44112 and previous config saved to /var/cache/conftool/dbconfig/20230209-224509-ladsgroup.json [22:50:55] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:888093|Start reading from rev_comment_id in cebwiki (T275246)]] (duration: 08m 26s) [22:50:59] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [23:00:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P44113 and previous config saved to /var/cache/conftool/dbconfig/20230209-230016-ladsgroup.json [23:03:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:08:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:10:37] RECOVERY - IPMI Sensor Status on mw2332 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [23:10:37] RECOVERY - IPMI Sensor Status on mw2329 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [23:14:29] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10Jhancock.wm) investigated each server individually. mw2329 had a bad cord. replaced. The input pow... [23:14:35] 10SRE, 10Traffic: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10tstarling) Regarding the concern that malicious user input could lead to injection of ESI tags: * In the old parser: * HTML comments in user input are completely removed * Angle brackets... [23:15:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44114 and previous config saved to /var/cache/conftool/dbconfig/20230209-231522-ladsgroup.json [23:15:26] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:26:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:31:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:32:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] - https://phabricator.wikimedia.org/T328343 (10Papaul) @Jhancock.wm thank you. You can resolve the task [23:34:37] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) Your shipment 1ZA19A020397868137 Delivered On Thursday, February 09 at 3:41 P.M. at Dock Delivered To LAREDO, TX US Received By: ESQUIVEL Proof of Delivery [23:36:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:37:10] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:38:12] (03PS1) 10Ladsgroup: Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887858 [23:38:14] (03PS1) 10Zabe: Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887859 (https://phabricator.wikimedia.org/T275246) [23:38:18] (03CR) 10Ladsgroup: [C: 03+2] Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887858 (owner: 10Ladsgroup) [23:38:32] (03Abandoned) 10Zabe: Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887859 (https://phabricator.wikimedia.org/T275246) (owner: 10Zabe) [23:39:14] (03Merged) 10jenkins-bot: Revert "Start reading from rev_comment_id in cebwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887858 (owner: 10Ladsgroup) [23:39:59] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:887858|Revert "Start reading from rev_comment_id in cebwiki"]] [23:41:51] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:887858|Revert "Start reading from rev_comment_id in cebwiki"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [23:41:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:42:10] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [23:49:03] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:887858|Revert "Start reading from rev_comment_id in cebwiki"]] (duration: 09m 04s)