[00:07:20] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[00:11:58] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[00:16:16] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T339178 (10phaultfinder)
[00:21:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) a:03Dzahn
[00:22:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) 05Open→03In progress
[00:24:00] <wikibugs>	 (03PS2) 10Eevans: cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639)
[00:26:53] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans)
[00:39:02] <wikibugs>	 (03PS3) 10Eevans: cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639)
[00:39:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929761
[00:39:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929761 (owner: 10TrainBranchBot)
[00:40:28] <wikibugs>	 (03PS4) 10Eevans: cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639)
[00:46:41] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans)
[00:54:21] <wikibugs>	 (03PS1) 10Jsn.sherman: beta: log click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930280
[01:00:18] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929761 (owner: 10TrainBranchBot)
[01:15:18] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:12] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:43:02] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10Legoktm)
[01:43:34] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:50:42] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:28] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:27:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:08] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:35:40] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:01:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:07:50] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:10:56] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:14:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) For now we are considering only the 'request_time_ms'. We are taking request time for all the probes/pulses and g...
[04:20:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) **Probenet Results:**    - Belarus (BY) {F37104295}    - Czechia (CZ) {F37104297}    - Kazakstan (KZ) {F37104299}...
[04:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:01:06] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:02:38] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:06:07] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati)
[05:12:07] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929020 (https://phabricator.wikimedia.org/T318436) (owner: 10Lupok)
[05:19:51] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] kserve-inference: refactor the predictor's container settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/930209 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[05:20:22] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: set readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930200 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[05:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:31:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1124,db1125,db1133: Binlog set to SBR [puppet] - 10https://gerrit.wikimedia.org/r/929948 (https://phabricator.wikimedia.org/T322993) (owner: 10Marostegui)
[05:33:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 to upgrade to 10.6.14 T338918', diff saved to https://phabricator.wikimedia.org/P49430 and previous config saved to /var/cache/conftool/dbconfig/20230615-053318-root.json
[05:33:23] <stashbot>	 T338918: Compile and package 10.6.14 - https://phabricator.wikimedia.org/T338918
[05:33:47] <wikibugs>	 (03PS1) 10Marostegui: db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/930292
[05:34:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/930292 (owner: 10Marostegui)
[05:37:54] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/930307
[05:38:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/930307 (owner: 10Marostegui)
[05:47:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49431 and previous config saved to /var/cache/conftool/dbconfig/20230615-054716-root.json
[05:52:05] <wikibugs>	 (03PS1) 10Jameel Kaisar: Update mappings for some countries based on initial Probenet data [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0600).
[06:02:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49432 and previous config saved to /var/cache/conftool/dbconfig/20230615-060220-root.json
[06:04:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Record new MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/930296
[06:06:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Record new MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/930296 (owner: 10Muehlenhoff)
[06:17:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49433 and previous config saved to /var/cache/conftool/dbconfig/20230615-061725-root.json
[06:23:35] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder)
[06:30:00] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "I checked the mappings, +1 there." [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar)
[06:31:34] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID 2066
[06:31:40] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID 2066
[06:32:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49434 and previous config saved to /var/cache/conftool/dbconfig/20230615-063230-root.json
[06:32:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:39:41] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Correctly locate firewall type for IDM. [puppet] - 10https://gerrit.wikimedia.org/r/930177 (owner: 10Slyngshede)
[06:39:52] <icinga-wm>	 PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Puppet has been disabled for 604926 seconds, message: Journal node is about to be decommissioned thus, swap the journal node with another -T338336 - {USER} - stevemunene, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:47:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49435 and previous config saved to /var/cache/conftool/dbconfig/20230615-064734-root.json
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0700).
[07:00:16] <apergos>	 morning!
[07:00:26] <apergos>	 today there are no patches scheduled for deployment in the calendar
[07:00:35] <apergos>	 likewise, no trainees have signed up for this slot
[07:01:15] <RhinosF1>	 Nice and peaceful Thursday morning then
[07:01:24] <apergos>	 yep, see everyone next time!
[07:01:59] <wikibugs>	 (03PS1) 10Slyngshede: Keymanagement: Fix squashed migration [software/bitu] - 10https://gerrit.wikimedia.org/r/930512
[07:02:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49436 and previous config saved to /var/cache/conftool/dbconfig/20230615-070239-root.json
[07:03:09] <wikibugs>	 (03PS2) 10Jameel Kaisar: Update mappings for some countries based on initial Probenet data [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318)
[07:03:22] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Keymanagement: Fix squashed migration [software/bitu] - 10https://gerrit.wikimedia.org/r/930512 (owner: 10Slyngshede)
[07:06:23] <wikibugs>	 (03CR) 10Jameel Kaisar: [C: 03+1] Update mappings for some countries based on initial Probenet data (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar)
[07:11:11] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host mw1492.eqiad.wmnet with OS buster
[07:11:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host mw1492.eqiad.wmnet with OS buster
[07:12:46] <wikibugs>	 (03CR) 10Ayounsi: "Could it be a validator instead, to catch the issue if the interface is created/modified manually too? (or with different automation). And" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930264 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[07:17:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve-inference: refactor the predictor's container settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/930209 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[07:17:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: set readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930200 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[07:17:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49437 and previous config saved to /var/cache/conftool/dbconfig/20230615-071744-root.json
[07:24:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage
[07:27:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage
[07:27:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review:  test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133 (10ayounsi)
[07:28:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133 (10ayounsi)
[07:29:58] <wikibugs>	 (03PS1) 10Elukey: ml-services: add "container" dict in experimental bloom-560m [deployment-charts] - 10https://gerrit.wikimedia.org/r/930514
[07:31:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add "container" dict in experimental bloom-560m [deployment-charts] - 10https://gerrit.wikimedia.org/r/930514 (owner: 10Elukey)
[07:32:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49438 and previous config saved to /var/cache/conftool/dbconfig/20230615-073248-root.json
[07:34:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[07:45:12] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:45:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Papaul thanks! I confirm that it works :)  I think that there is only one thing to do, namely update the documentation (https://wikitech.wikimedia.org/wiki/Management_Interfaces#Di...
[07:46:20] <wikibugs>	 (03CR) 10Ayounsi: "Overall lgtm, was it tested? Maybe we can compare the execution time to see the improvement?" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) (owner: 10Cathal Mooney)
[07:46:46] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:49:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[07:55:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1001"
[08:00:04] <jouncebot>	 jnuche and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0800).
[08:00:39] <jnuche>	 morning, I'll roll forward the train in 5m
[08:04:58] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930516 (https://phabricator.wikimedia.org/T337527)
[08:05:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930516 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot)
[08:05:46] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930516 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot)
[08:06:24] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: add transparencyreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/930188 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto)
[08:07:14] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add transparencyreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/930188 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto)
[08:10:10] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[08:11:19] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[08:11:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:11:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] role::cache::{text,upload}: move ulsfo varnishkafkas to PKI [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[08:13:04] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[08:13:31] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.13  refs T337527
[08:13:34] <stashbot>	 T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527
[08:15:02] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[08:16:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:19:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:20:20] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5025.eqsin.wmnet
[08:20:23] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5017.eqsin.wmnet
[08:20:58] <fabfur>	 !log reboot cp5017 and cp5025 for kernel upgrade (T335835)
[08:21:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:09] <wikibugs>	 (03PS1) 10Jaime Nuche: jenkins: add doc rsync password to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/930517 (https://phabricator.wikimedia.org/T336168)
[08:27:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:31:17] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5017.eqsin.wmnet
[08:31:24] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5025.eqsin.wmnet
[08:31:58] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[08:32:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:34:41] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/930517/41738/" [puppet] - 10https://gerrit.wikimedia.org/r/930517 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[08:37:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Key confirmed via out-of-band validation" [puppet] - 10https://gerrit.wikimedia.org/r/929994 (https://phabricator.wikimedia.org/T336769) (owner: 10Vgutierrez)
[08:37:26] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:37:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Key confirmed via out-of-band validation" [homer/public] - 10https://gerrit.wikimedia.org/r/929998 (https://phabricator.wikimedia.org/T336769) (owner: 10Vgutierrez)
[08:37:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] admin: Update vgutierrez@yubikey5 key [puppet] - 10https://gerrit.wikimedia.org/r/929994 (https://phabricator.wikimedia.org/T336769) (owner: 10Vgutierrez)
[08:42:04] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:47:40] <wikibugs>	 (03CR) 10Jelto: "looks mostly good, thanks for moving the bash script out of a erb template. Two comments in line." [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney)
[08:52:47] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1001"
[08:52:52] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1492.eqiad.wmnet with OS buster
[08:52:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host mw1492.eqiad.wmnet with OS buster completed: - mw1492 (**WARN**)   - Downtimed on Icinga/Alertm...
[08:54:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet
[08:57:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::cache::{text,upload}: move ulsfo varnishkafkas to PKI [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[08:58:50] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1492 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[08:59:29] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1492.eqiad.wmnet
[08:59:29] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1492.eqiad.wmnet
[09:00:07] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1492.eqiad.wmnet
[09:00:07] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1492.eqiad.wmnet
[09:00:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet
[09:01:00] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5026.eqsin.wmnet
[09:01:01] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5018.eqsin.wmnet
[09:01:07] <fabfur>	 !log reboot cp5018 and cp5026 for kernel upgrade (T335835)
[09:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:07] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[09:04:14] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev - aborrero@cumin2002"
[09:05:06] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:05:20] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev - aborrero@cumin2002"
[09:05:20] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:05:40] <elukey>	 !log move varnishkafka instances in ulsfo to PKI - T337825
[09:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:43] <stashbot>	 T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825
[09:06:46] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2004-dev.codfw.wmnet on all recursors
[09:06:48] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2004-dev.codfw.wmnet on all recursors
[09:07:56] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2004-dev.mgmt.codfw.wmnet on all recursors
[09:07:59] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2004-dev.mgmt.codfw.wmnet on all recursors
[09:08:16] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.codfw.wmnet with OS bullseye
[09:08:32] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.w...
[09:12:04] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5026.eqsin.wmnet
[09:12:10] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudservices2004-dev: put into service with new setup [puppet] - 10https://gerrit.wikimedia.org/r/930212 (https://phabricator.wikimedia.org/T338778)
[09:13:48] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5018.eqsin.wmnet
[09:14:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:19:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:24:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2004-dev: put into service with new setup [puppet] - 10https://gerrit.wikimedia.org/r/930212 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez)
[09:25:38] <Amir1>	 jouncebot: nowandnext
[09:25:38] <jouncebot>	 For the next 0 hour(s) and 34 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0800)
[09:25:38] <jouncebot>	 In 0 hour(s) and 34 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1000)
[09:25:38] <jouncebot>	 In 0 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1000)
[09:26:37] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero)
[09:29:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) (owner: 10Gehel)
[09:30:20] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices2004-dev: fix typo in role assignment [puppet] - 10https://gerrit.wikimedia.org/r/930523 (https://phabricator.wikimedia.org/T338778)
[09:31:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2004-dev: fix typo in role assignment [puppet] - 10https://gerrit.wikimedia.org/r/930523 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez)
[09:34:06] <wikibugs>	 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jgiannelos)
[09:34:18] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5027.eqsin.wmnet
[09:34:18] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5019.eqsin.wmnet
[09:34:22] <fabfur>	 !log reboot cp5019 and cp5027 for kernel upgrade (T335835)
[09:34:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:53] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2004-dev: fix role [puppet] - 10https://gerrit.wikimedia.org/r/930524 (https://phabricator.wikimedia.org/T338778)
[09:35:10] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudservices2004-dev: fix role [puppet] - 10https://gerrit.wikimedia.org/r/930524 (https://phabricator.wikimedia.org/T338778)
[09:37:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2004-dev: fix role [puppet] - 10https://gerrit.wikimedia.org/r/930524 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez)
[09:39:59] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[09:41:28] <wikibugs>	 (03CR) 10Cathal Mooney: Validate port block speed combo in server provision script for QFX5120 (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930264 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[09:41:55] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev.private.codfw.wikimedia.cloud - aborrero@cumin2002"
[09:42:58] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev.private.codfw.wikimedia.cloud - aborrero@cumin2002"
[09:42:58] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:43:15] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2004-dev.private.codfw.wikimedia.cloud on all recursors
[09:43:18] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2004-dev.private.codfw.wikimedia.cloud on all recursors
[09:47:18] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: Send testwiki traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/930547 (https://phabricator.wikimedia.org/T337489)
[09:47:24] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5027.eqsin.wmnet
[09:48:18] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: Send testwiki traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/930547 (https://phabricator.wikimedia.org/T337489)
[09:51:14] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] handler.images: remove async from poolcounter release (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[09:51:47] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp5019.eqsin.wmnet
[09:51:52] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] jenkins: add doc rsync password to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/930517 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[09:53:05] <moritzm>	 !log installing openssl security updates on buster
[09:53:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:12] <icinga-wm>	 PROBLEM - Check systemd state on cp5019 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy_stek_job.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:54:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:57:23] <moritzm>	 !log restarting FPM on mw canaries
[09:57:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:37] <wikibugs>	 (03PS1) 10Elukey: ml-services: tweak readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930550
[09:59:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: tweak readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930550 (owner: 10Elukey)
[10:00:03] <wikibugs>	 (03Merged) 10jenkins-bot: handler.images: remove async from poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[10:00:06] <jouncebot>	 mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1000).
[10:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1000)
[10:02:47] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:03:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb1021.eqiad.wmnet with reason: T337961
[10:03:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb1021.eqiad.wmnet with reason: T337961
[10:03:39] <stashbot>	 T337961: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961
[10:03:55] <wikibugs>	 (03CR) 10Cathal Mooney: Modify network report to get prefixes for all vlans before checks (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) (owner: 10Cathal Mooney)
[10:04:31] <Amir1>	 !log root@clouddb1021.eqiad.wmnet[metawiki]> ALTER TABLE pagelinks ROW_FORMAT=COMPRESSED; (T337961)
[10:04:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:36] <btullis>	 !log removed hadoop packages incorrectly labelled for i386 in thirdparty/bigtop15 bullseye-wikimedia
[10:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all
[10:08:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Provided a dedicated KDC logrotate config and fix service reload [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906)
[10:08:45] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff)
[10:08:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[10:09:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:14:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry
[10:14:50] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: fix poolcounter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/930552
[10:15:19] <logmsgbot>	 !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[10:16:02] <logmsgbot>	 !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[10:16:43] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: fix poolcounter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/930552 (owner: 10Hnowlan)
[10:17:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry
[10:17:43] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: fix poolcounter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/930552 (owner: 10Hnowlan)
[10:18:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[10:18:32] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[10:19:59] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357)
[10:20:31] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[10:20:37] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[10:22:36] <icinga-wm>	 RECOVERY - Check systemd state on cp5019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:22:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-codfw
[10:23:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/930556/41739/" [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[10:23:38] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] jenkins: add doc rsync password to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/930517 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[10:23:46] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/929764 (https://phabricator.wikimedia.org/T339223)
[10:30:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:30:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all
[10:30:45] <wikibugs>	 (03PS1) 10Kosta Harlan: Section images: Fix scrolling to placeholder [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209)
[10:32:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:32:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:32:55] <wikibugs>	 (03PS1) 10Hnowlan: Revert "handler.images: remove async from poolcounter release" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533
[10:33:48] <wikibugs>	 (03CR) 10Hnowlan: "sigh" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 (owner: 10Hnowlan)
[10:33:51] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357)
[10:34:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-codfw
[10:34:36] <logmsgbot>	 !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[10:34:55] <logmsgbot>	 !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[10:35:07] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert)
[10:36:09] <wikibugs>	 (03Merged) 10jenkins-bot: modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert)
[10:37:00] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[10:37:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:37:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/930556/41740/" [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[10:38:20] <wikibugs>	 (03PS6) 10Clément Goubert: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T331609)
[10:40:03] <wikibugs>	 (03CR) 10Mvolz: [C: 03+1] "The patterns look fine but haven't had a chance to test." [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[10:41:40] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[10:44:09] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/929765
[10:51:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:52:47] <wikibugs>	 (03PS6) 10Clément Goubert: utils: Simple dblist_to_urllist.py script [puppet] - 10https://gerrit.wikimedia.org/r/923591
[10:54:08] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:54:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:55:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:56:20] <logmsgbot>	 !log fabfur@cumin1001 conftool action : set/pooled=yes; selector: name=cp5019.eqsin.wmnet
[10:57:28] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5028.eqsin.wmnet
[10:57:29] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5020.eqsin.wmnet
[10:57:37] <fabfur>	 !log reboot cp5020 and cp5028 for kernel upgrade (T335835)
[10:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:53] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:58:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:00:19] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357)
[11:00:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:00:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[11:00:47] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/929765 (owner: 10PipelineBot)
[11:01:37] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/929765 (owner: 10PipelineBot)
[11:01:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Section images: Fix scrolling to placeholder [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan)
[11:02:41] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357)
[11:04:08] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:04:19] <wikibugs>	 (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan)
[11:06:31] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357)
[11:07:48] <icinga-wm>	 PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:07:53] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:08:16] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5020.eqsin.wmnet
[11:08:33] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5028.eqsin.wmnet
[11:09:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/930556/41742/" [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[11:11:59] <logmsgbot>	 !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices2004-dev.codfw.wmnet with OS bullseye
[11:12:09] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.wmnet...
[11:13:03] <Amir1>	 jouncebot: nowandnext
[11:13:03] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 46 minute(s)
[11:13:03] <jouncebot>	 In 1 hour(s) and 46 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300)
[11:13:04] <jouncebot>	 In 1 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300)
[11:13:07] <Amir1>	 cooool
[11:13:18] <wikibugs>	 (03PS2) 10Ladsgroup: Remove nlwiki from windows-1252 encoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154)
[11:13:21] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Remove nlwiki from windows-1252 encoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154) (owner: 10Ladsgroup)
[11:13:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154) (owner: 10Ladsgroup)
[11:14:08] <wikibugs>	 (03Merged) 10jenkins-bot: Remove nlwiki from windows-1252 encoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154) (owner: 10Ladsgroup)
[11:14:39] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]]
[11:14:43] <stashbot>	 T128154: Migrate all old DB rows from windows-1252 to UTF-8 on nlwiki - https://phabricator.wikimedia.org/T128154
[11:16:14] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[11:17:05] <wikibugs>	 (03PS1) 10Ladsgroup: Switch five large wikis to extlinks read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930560 (https://phabricator.wikimedia.org/T335343)
[11:26:14] <wikibugs>	 (03PS6) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182
[11:26:23] <Amir1>	 11:24:36 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2259.codfw.wmnet', 'mw1366.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw1404.eqiad.wmnet', 'mw2289.codfw.wmnet', 'deploy2002.codfw.wmnet', 'mw1398.eqiad.wmnet', 'mw1486.eqiad.wmnet', 'mw1420.eqiad.wmnet', 'mw2300.codfw.wmnet'] (ran as mwdeploy@parse1002.eqiad.wmnet) returned [255]: ssh: connect to host parse1002.eqiad.wmnet 
[11:26:23] <Amir1>	 port 22: Connection timed out
[11:26:23] <wikibugs>	 (03CR) 10EoghanGaffney: gitlab: Add locking to backups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney)
[11:28:23] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41743/console" [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney)
[11:28:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-eqiad
[11:29:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public
[11:29:50] <wikibugs>	 (03PS4) 10Samtar: IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930008 (https://phabricator.wikimedia.org/T336763)
[11:31:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public
[11:31:48] <Lucas_WMDE>	 I also get a timeout when trying to SSH to parse1002
[11:32:15] <wikibugs>	 (03PS1) 10Kosta Harlan: Section images: update rtl asset with flipped question mark [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930535 (https://phabricator.wikimedia.org/T335207)
[11:32:17] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] (duration: 17m 38s)
[11:32:20] <stashbot>	 T128154: Migrate all old DB rows from windows-1252 to UTF-8 on nlwiki - https://phabricator.wikimedia.org/T128154
[11:33:07] <Amir1>	 claime effie: do you know what's happening? 
[11:33:14] <Amir1>	 (parse1002 is unreachable)
[11:33:22] <claime>	 Hmm no
[11:33:33] <Amir1>	 it's making scap sad
[11:33:49] <claime>	 I'll check
[11:33:52] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) p:05Medium→03High We hit the fan once again with the Israeli [[ https://he.wikisource.org/wiki/פקודת_מס_הכנסה | Income...
[11:34:51] <Amir1>	 tahnks
[11:35:25] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Switch five large wikis to extlinks read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930560 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup)
[11:35:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930560 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup)
[11:36:41] <claime>	 Amir1: No ssh, no console via rac
[11:36:57] <claime>	 I'll pool=inactive it so you can proceed and hard reboot it
[11:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: Switch five large wikis to extlinks read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930560 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup)
[11:37:12] <wikibugs>	 (03PS3) 10Hnowlan: trafficserver: route proton requests via the API gateway [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678)
[11:37:15] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930560|Switch five large wikis to extlinks read new (T335343)]]
[11:37:19] <stashbot>	 T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343
[11:37:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema
[11:37:55] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=parse1002.eqiad.wmnet
[11:38:25] <claime>	 Amir1: depooled, tell me if it's enough to quiet scap
[11:38:48] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930560|Switch five large wikis to extlinks read new (T335343)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[11:39:16] <claime>	 !log parse1002 not responding to ssh or console, depooled
[11:39:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema
[11:40:09] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on parse1002.eqiad.wmnet with reason: Powercycle
[11:40:10] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930586
[11:40:14] <Amir1>	 thanks
[11:40:22] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on parse1002.eqiad.wmnet with reason: Powercycle
[11:40:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-eqiad
[11:43:18] <icinga-wm>	 RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms
[11:43:52] <claime>	 It's back up, Amir1 tell me when you're done with your deployments and I'll scap pull/repool
[11:44:17] <Amir1>	 sure, thanks. it'll be done in a minute or two
[11:45:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw
[11:46:26] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930560|Switch five large wikis to extlinks read new (T335343)]] (duration: 09m 10s)
[11:46:29] <stashbot>	 T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343
[11:47:01] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[11:48:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[11:48:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad
[11:49:41] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] Revert "handler.images: remove async from poolcounter release" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 (owner: 10Hnowlan)
[11:49:53] <moritzm>	 !log restarting slapd on seagorgium/serpens
[11:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:29] <effie>	 Amir1: sorry I am at lunch, can I help ?
[11:50:29] <wikibugs>	 (03PS5) 10Samtar: IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930008 (https://phabricator.wikimedia.org/T336763)
[11:50:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[11:50:39] <claime>	 effie: all good, it's handled
[11:50:42] <Amir1>	 yup
[11:50:46] <Amir1>	 claime: I'm done
[11:50:46] <effie>	 cool thank you claime 
[11:50:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "merging since this is a NOOP for eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[11:51:02] <claime>	 Amir1: Great, scap pulling on parse1002 and putting it back in the pool
[11:51:19] <Amir1>	 <3
[11:51:31] <claime>	 !log Repooled parse1002.eqiad.wmnet after powercycle
[11:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[11:52:44] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.codfw.wmnet with OS bullseye
[11:52:56] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.w...
[11:54:30] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] api-gateway: add device-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/930214 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan)
[11:58:39] <moritzm>	 !log restarting exim on lists1001
[11:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:48] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1002.eqiad.wmnet
[12:02:49] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1002.eqiad.wmnet
[12:03:24] <wikibugs>	 (03PS1) 10Stevemunene: analytics: Decommission analytics106[1-3] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930580 (https://phabricator.wikimedia.org/T317861)
[12:03:31] <wikibugs>	 (03PS1) 10Stevemunene: analytics: Remove analytics106[1-3] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930581 (https://phabricator.wikimedia.org/T317861)
[12:03:37] <wikibugs>	 (03PS1) 10Stevemunene: analytics: Decommission analytics106[4-6] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930582 (https://phabricator.wikimedia.org/T317861)
[12:03:39] <wikibugs>	 (03PS1) 10Stevemunene: analytics: Remove analytics106[4-6] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930583 (https://phabricator.wikimedia.org/T317861)
[12:03:43] <wikibugs>	 (03PS1) 10Stevemunene: analytics: Decommission analytics106[7-8] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930584 (https://phabricator.wikimedia.org/T317861)
[12:03:45] <wikibugs>	 (03PS1) 10Stevemunene: analytics: Remove analytics106[7-8] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930585 (https://phabricator.wikimedia.org/T317861)
[12:03:47] <wikibugs>	 (03PS1) 10Stevemunene: analytics: Decommission analytics1069 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930606 (https://phabricator.wikimedia.org/T317861)
[12:03:49] <wikibugs>	 (03PS1) 10Stevemunene: analytics: Remove analytics1069 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930607 (https://phabricator.wikimedia.org/T317861)
[12:05:11] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10Jclark-ctr) Replaced optic.  Cleaned fiber on device side and on pp  (port serial 21615538) cable id 5249
[12:06:52] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5029.eqsin.wmnet
[12:06:54] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5021.eqsin.wmnet
[12:07:08] <fabfur>	 !log reboot cp5021 and cp5029 for kernel upgrade (T335835)
[12:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:09] <wikibugs>	 (03PS1) 10Slyngshede: Keymanagement: Handle MariaDB constraint limitation. [software/bitu] - 10https://gerrit.wikimedia.org/r/930608
[12:11:32] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage
[12:12:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ArielGlenn) >>! In T334955#8929123, @Papaul wrote: >... The 2 nodes are ready. Thank you   Thank you, we'll take 'em! :-)
[12:13:29] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41744/console" [puppet] - 10https://gerrit.wikimedia.org/r/929713 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney)
[12:14:01] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage
[12:17:39] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5021.eqsin.wmnet
[12:18:08] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5029.eqsin.wmnet
[12:18:10] <wikibugs>	 (03PS1) 10AikoChou: changeprop: remove match on specific wiki_id for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899)
[12:19:35] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert)
[12:20:20] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] trafficserver: route proton requests via the API gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan)
[12:27:41] <moritzm>	 !log installing containerd security updates
[12:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:42] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revert-risk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/930613
[12:31:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/930608 (owner: 10Slyngshede)
[12:32:39] <wikibugs>	 (03PS1) 10Samtar: IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763)
[12:32:45] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Keymanagement: Handle MariaDB constraint limitation. [software/bitu] - 10https://gerrit.wikimedia.org/r/930608 (owner: 10Slyngshede)
[12:33:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar)
[12:33:48] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: recursor: drop IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/930616 (https://phabricator.wikimedia.org/T338778)
[12:33:57] <wikibugs>	 (03PS2) 10Samtar: IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763)
[12:34:45] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@d458338]: (no justification provided)
[12:34:54] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@d458338]: (no justification provided) (duration: 00m 09s)
[12:35:54] <moritzm>	 !log installing ffmpeg security updates
[12:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:57] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) @ayounsi  removed 8 cables. deleted from netbox
[12:37:38] <wikibugs>	 (03PS1) 10Samtar: IS-Labs: Enable Phonos everywhere, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930617 (https://phabricator.wikimedia.org/T336763)
[12:38:24] <TheresNoTime>	 jouncebot: nowandnext
[12:38:24] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 21 minute(s)
[12:38:24] <jouncebot>	 In 0 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300)
[12:38:24] <jouncebot>	 In 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300)
[12:40:37] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5030.eqsin.wmnet
[12:40:39] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5022.eqsin.wmnet
[12:40:47] <fabfur>	 !log reboot cp5022 and cp5030 for kernel upgrade (T335835)
[12:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi)
[12:41:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930617 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar)
[12:41:10] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) 05Open→03Resolved Awesome, thanks!
[12:42:11] <wikibugs>	 (03Merged) 10jenkins-bot: IS-Labs: Enable Phonos everywhere, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930617 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar)
[12:43:08] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: pdns: recursor: drop IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/930616 (https://phabricator.wikimedia.org/T338778)
[12:45:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons.
[12:46:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/930616/41747/" [puppet] - 10https://gerrit.wikimedia.org/r/930616 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez)
[12:46:24] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339168 (10Jclark-ctr) a:03Jclark-ctr
[12:47:41] <wikibugs>	 (03PS3) 10Samtar: IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763)
[12:48:00] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[12:48:13] <wikibugs>	 (03PS2) 10Samtar: Switch VisualEditor to bypass RESTbase on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[12:48:16] <wikibugs>	 (03PS2) 10Samtar: beta: log click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930280 (owner: 10Jsn.sherman)
[12:51:57] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5030.eqsin.wmnet
[12:53:26] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5022.eqsin.wmnet
[12:53:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:56:10] <icinga-wm>	 RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms
[12:57:11] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[12:57:24] <wikibugs>	 (03PS1) 10Btullis: Update the mediawiki_history_reduced sna[pshot to AQS [puppet] - 10https://gerrit.wikimedia.org/r/930620
[12:57:42] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[12:58:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:58:39] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[12:59:07] <wikibugs>	 (03PS4) 10Hokwelum: Modify the global blocks script to override output dir via a command line arg [puppet] - 10https://gerrit.wikimedia.org/r/928861
[12:59:09] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[13:00:06] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300)
[13:00:06] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300).
[13:00:06] <jouncebot>	 duesen, JSherman, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:06] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[13:00:39] <JSherman>	 present and ready
[13:00:43] <duesen>	 o/
[13:00:59] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[13:01:40] <MatmaRex>	 hi, i added one more thing to the window
[13:02:39] <Lucas_WMDE>	 I can’t deploy yet, sorry (maybe at :30 or so)
[13:05:19] * TheresNoTime can deploy
[13:05:36] <duesen>	 Ah, I was just about to say I can also self-service :)
[13:05:51] <TheresNoTime>	 duesen: feel free, but I don't mind :)
[13:05:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons.
[13:06:30] <JSherman>	 I'll need help, but mine is a beta only config change, so hopefully it will be an easy one.
[13:06:48] <duesen>	 TheresNoTime: I'll do it.
[13:07:02] <TheresNoTime>	 duesen: go ahead, ping me when you're done?
[13:07:33] <duesen>	 will do
[13:07:34] <duesen>	 merging now
[13:07:38] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+2] Switch VisualEditor to bypass RESTbase on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:07:47] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[13:08:10] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons.
[13:08:25] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[13:08:29] <wikibugs>	 (03Merged) 10jenkins-bot: Switch VisualEditor to bypass RESTbase on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:08:54] <duesen>	 starting backport
[13:09:09] <kostajh>	 I’m here
[13:09:17] <logmsgbot>	 !log daniel@deploy1002 Started scap: Backport for [[gerrit:929364|Switch VisualEditor to bypass RESTbase on all wikis. (T320529)]]
[13:09:20] <stashbot>	 T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529
[13:09:48] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "prep for deploy" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan)
[13:10:13] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "prep for deploy" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930535 (https://phabricator.wikimedia.org/T335207) (owner: 10Kosta Harlan)
[13:10:19] <TheresNoTime>	 kostajh: set those merging ^
[13:10:41] <logmsgbot>	 !log daniel@deploy1002 daniel: Backport for [[gerrit:929364|Switch VisualEditor to bypass RESTbase on all wikis. (T320529)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:10:46] <kostajh>	 JSherman: for beta patches, you can just +2 those, no need to put in a deployment window. (AFAIK, someone correct me if I'm wrong please)
[13:10:51] <duesen>	 hm, I just realized I need to also change this for labs. How do I even deploy a config change for labs?
[13:11:24] <kostajh>	 duesen: it will apply automatically to beta
[13:11:36] <kostajh>	 overrides for Labs are in InitialiseSettings-labs.php
[13:11:40] <kostajh>	 TheresNoTime: thank you!
[13:13:26] <duesen>	 testing on debug looks good.
[13:13:36] <duesen>	 kostajh: ok thanks.
[13:13:37] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5023.eqsin.wmnet
[13:13:38] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5031.eqsin.wmnet
[13:13:43] <fabfur>	 !log reboot cp5023 and cp5031 for kernel upgrade (T335835)
[13:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:13] <JSherman>	 kostajh: that's good to know; if one of the deployers can confirm, I'm happy to just +2 it.
[13:14:47] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[13:15:14] <duesen>	 ok, syncing
[13:16:27] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "Love it, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar)
[13:16:40] <TheresNoTime>	 JSherman: that's correct for IS-labs/CS-labs, it'll sync to the beta cluster every ~10m once +2'd — iirc it might show as other changes present to the deployer in the next window
[13:17:55] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy, prod no/op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930280 (owner: 10Jsn.sherman)
[13:17:59] <JSherman>	 TheresNoTime: Ok; I'll go ahead and +2. Thanks!
[13:18:12] <TheresNoTime>	 JSherman: just did :D
[13:18:22] <JSherman>	 :-)
[13:18:47] <wikibugs>	 (03Merged) 10jenkins-bot: beta: log click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930280 (owner: 10Jsn.sherman)
[13:19:16] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:19:23] <JSherman>	 Now I know for next time. If I do self service a beta config change in the future, should I still wait for a window and hang out here to do it, or is it a whenever thing?
[13:19:46] <wikibugs>	 (03PS1) 10Daniel Kinzler: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930626
[13:19:48] <wikibugs>	 (03PS1) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529)
[13:20:25] <TheresNoTime>	 JSherman: I'd still double-check there's nothing going on in here, and maybe just announce you're going to do it?
[13:21:06] <logmsgbot>	 !log daniel@deploy1002 Finished scap: Backport for [[gerrit:929364|Switch VisualEditor to bypass RESTbase on all wikis. (T320529)]] (duration: 11m 48s)
[13:21:10] <stashbot>	 T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529
[13:21:12] <JSherman>	 TheresNoTime: ack; sounds reasonable.
[13:21:21] <duesen>	 TheresNoTime: can I deploy this one as well, so labs is in sync? https://gerrit.wikimedia.org/r/930627 Doesn't have to be now, but it shouldn't be out of whack for too long.
[13:21:35] <duesen>	 ok, config deployed to all prod wikis.
[13:21:40] <TheresNoTime>	 duesen: go ahead :)
[13:21:41] <duesen>	 Monitoring metrics
[13:21:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "awesome work, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar)
[13:21:55] <kostajh>	 duesen: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930626/1 looks wrong to me
[13:22:00] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) I will be working with @Clement_Goubert today at 10am CT to relocate those mw nodes.
[13:22:08] <duesen>	 TheresNoTime: Do you think it's ok to deploy it while monitoring metrics to see if we need to revert the first one?
[13:22:19] <wikibugs>	 (03PS1) 10Herron: thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995)
[13:22:34] <duesen>	 kostajh: ah, right. i messed up the rebase
[13:22:43] <TheresNoTime>	 duesen: wait one, bad rebase (?) yeah
[13:23:06] <wikibugs>	 (03PS2) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529)
[13:23:19] <wikibugs>	 (03Abandoned) 10Daniel Kinzler: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930626 (owner: 10Daniel Kinzler)
[13:23:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] analytics: Decommission analytics106[1-3] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930580 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene)
[13:24:14] <effie>	 duesen: so far so good?
[13:24:16] <duesen>	 kostajh: fixed. looks good now?
[13:24:24] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5023.eqsin.wmnet
[13:24:28] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:24:28] <duesen>	 effie: stash access is going up. still looking.
[13:24:39] <effie>	 ok cool, I will take a look too 
[13:24:43] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5031.eqsin.wmnet
[13:24:46] <duesen>	 kostajh, TheresNoTime: do you think i can merge the patch for labs?
[13:25:04] <wikibugs>	 (03PS1) 10Hnowlan: add discovery records for rest-gateway and device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/930631 (https://phabricator.wikimedia.org/T335505)
[13:25:17] <kostajh>	 duesen: can I start the GrowthExperiments backports?
[13:26:00] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:26:21] <wikibugs>	 (03CR) 10Kosta Harlan: Switch VisualEditor to bypass RESTbase on labs. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:26:48] <duesen>	 stash writes doubled, 60 -> 130/sec
[13:27:42] <duesen>	 kostajh: does it still need merging? code or config?
[13:28:13] <duesen>	 ...small bump in sql writes...
[13:28:31] <duesen>	 ...small bump in network utilization on db hosts
[13:28:40] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons.
[13:29:05] <kostajh>	 duesen: I need to deploy the patches for GrowthExperiments
[13:29:14] <kostajh>	 they are to wmf.13
[13:29:16] <TheresNoTime>	 2c is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930627 is still a valid config, so can be +2'd if needed. The two GrowthExperiments patches (kostajh) are almost merged
[13:29:25] <duesen>	 effie: all looking good. stash access seems to stablilize at > 150 per minute
[13:29:30] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Revert "handler.images: remove async from poolcounter release" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 (owner: 10Hnowlan)
[13:30:21] <duesen>	 kostajh: sure. can I deploy another config patch while we are waiting for them to merge?
[13:31:17] <kostajh>	 yep
[13:31:22] <duesen>	 cool
[13:31:41] <kostajh>	 TheresNoTime: will you continue the deployment process for GrowthExperiments patches or do you want me to take over?
[13:31:41] <duesen>	 ah nice, VE backend transform latency went down by 50%
[13:31:53] <kostajh>	 (I'm joining a meeting so would prefer if you keep moving them forward, if that's alright with you.)
[13:32:01] <TheresNoTime>	 kostajh: I can carry on
[13:32:09] <kostajh>	 ty!
[13:32:11] <JSherman>	 Okay, I verified that my instruments can now push events to that stream. TheresNoTime: and kostajh: thanks!
[13:32:15] <TheresNoTime>	 duesen: are you wanting to deploy a beta config patch now?
[13:32:19] <TheresNoTime>	 JSherman: ack :)
[13:32:47] <wikibugs>	 (03CR) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on labs. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:33:02] <duesen>	 TheresNoTime: yes. merging.
[13:33:05] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+2] Switch VisualEditor to bypass RESTbase on labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:33:08] <TheresNoTime>	 ty
[13:33:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "handler.images: remove async from poolcounter release" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 (owner: 10Hnowlan)
[13:33:50] <wikibugs>	 (03Merged) 10jenkins-bot: Switch VisualEditor to bypass RESTbase on labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:34:30] <wikibugs>	 (03PS1) 10Elukey: ml-services: add more experimental settings for LLMs [deployment-charts] - 10https://gerrit.wikimedia.org/r/930632 (https://phabricator.wikimedia.org/T334583)
[13:34:51] * Lucas_WMDE now around
[13:34:56] <Lucas_WMDE>	 anything left to deploy or all good?
[13:35:06] <Reedy>	 all of the things
[13:35:11] <Reedy>	 oh, actually
[13:35:15] <TheresNoTime>	 Lucas_WMDE: I'm just about to deploy the two GrowthExperiments patches
[13:35:19] <Lucas_WMDE>	 ok!
[13:35:33] <wikibugs>	 (03PS3) 10Reedy: Revert "Temporarily disable UCoC link from non tech wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886)
[13:35:42] <Reedy>	 Lucas_WMDE: ^ if you want to deploy that, I wouldn't complain :)
[13:35:53] <duesen>	 TheresNoTime: can i run scap on the beta config patch?
[13:36:12] <TheresNoTime>	 duesen: https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/448266/console
[13:36:17] <TheresNoTime>	 it's doing it
[13:36:41] <wikibugs>	 (03PS4) 10Hnowlan: svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139)
[13:37:10] <duesen>	 TheresNoTime: oh, now I get what kostajh meant by "automatic". Cool :)
[13:37:16] <wikibugs>	 (03CR) 10Hnowlan: svg: attempt to build valid locales from hyphenated languages (032 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan)
[13:38:34] <wikibugs>	 (03Merged) 10jenkins-bot: Section images: Fix scrolling to placeholder [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan)
[13:38:37] <wikibugs>	 (03Merged) 10jenkins-bot: Section images: update rtl asset with flipped question mark [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930535 (https://phabricator.wikimedia.org/T335207) (owner: 10Kosta Harlan)
[13:39:15] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:930531|Section images: Fix scrolling to placeholder (T335209)]], [[gerrit:930535|Section images: update rtl asset with flipped question mark (T335207)]]
[13:39:21] <stashbot>	 T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209
[13:39:21] <stashbot>	 T335207: Section-level images: onboarding dialog - https://phabricator.wikimedia.org/T335207
[13:39:37] <Lucas_WMDE>	 Reedy: sure, once everything else is done
[13:40:36] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "I was slightly thrown by the commit saying we aren't hardcoding the port any more. But it's rather that we're moving it into hiera, right?" [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans)
[13:40:42] <TheresNoTime>	 Lucas_WMDE: do you want to take over after these two are done? (just the maintenance script to start, and Re/edy's patch)
[13:40:44] <logmsgbot>	 !log samtar@deploy1002 kharlan and samtar: Backport for [[gerrit:930531|Section images: Fix scrolling to placeholder (T335209)]], [[gerrit:930535|Section images: update rtl asset with flipped question mark (T335207)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:40:58] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) Next steps:  * Roll out the changes to eqsin, and monitor. * Roll out the changes to codfw, and monitor. * Roll out the changes to eqiad, and monitor. * Roll out the ch...
[13:40:58] <TheresNoTime>	 kostajh: both live on mwdebug, can you test?
[13:41:51] <kostajh>	 TheresNoTime: sure, one sec
[13:41:57] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] doc: Clean up leftover bits from switch to quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/929713 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney)
[13:42:40] <wikibugs>	 (03PS1) 10Elukey: role::cache::{text,upload}: move vk instances to PKI in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/930633 (https://phabricator.wikimedia.org/T337825)
[13:43:12] <kostajh>	 TheresNoTime: lgtm
[13:43:12] <wikibugs>	 (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[13:43:17] <TheresNoTime>	 syncing
[13:43:58] <Winston_Sung[m]>	 Anyone around for backports?
[13:44:14] <TheresNoTime>	 Winston_Sung[m]: me currently, Lucas_WMDE in a moment probably
[13:44:54] <Winston_Sung[m]>	 Here is the requested change to be backported: https://gerrit.wikimedia.org/r/929647
[13:45:08] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930587
[13:47:02] <TheresNoTime>	 Lucas_WMDE: y/n on being able to take over?
[13:47:08] <Lucas_WMDE>	 sure
[13:47:20] <Lucas_WMDE>	 Winston_Sung[m]: was the idea in https://phabricator.wikimedia.org/T337527#8926660 that the backport should be done before the train reached group2?
[13:47:30] <Lucas_WMDE>	 because group2 is on wmf.13 now, the train was early today
[13:48:04] <icinga-wm>	 PROBLEM - Check systemd state on mw1448 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:09] <TheresNoTime>	 ...
[13:48:35] <TheresNoTime>	 deploy is currently in `php-fpm-restart`
[13:48:56] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:930531|Section images: Fix scrolling to placeholder (T335209)]], [[gerrit:930535|Section images: update rtl asset with flipped question mark (T335207)]] (duration: 09m 40s)
[13:49:01] <stashbot>	 T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209
[13:49:01] <stashbot>	 T335207: Section-level images: onboarding dialog - https://phabricator.wikimedia.org/T335207
[13:49:03] <TheresNoTime>	 kostajh: live
[13:49:13] <kostajh>	 thanks very much!
[13:49:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan)
[13:49:29] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5032.eqsin.wmnet
[13:49:30] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5024.eqsin.wmnet
[13:49:34] <TheresNoTime>	 Lucas_WMDE: all that's left is MatmaRex's script run, and the two last-minute additions
[13:49:43] <fabfur>	 !log reboot cp5024 and cp5032 for kernel upgrade (T335835)
[13:49:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:17] <Lucas_WMDE>	 Reedy: want to remove your -1 from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/924567 ?
[13:51:13] <Reedy>	 done
[13:51:16] <Lucas_WMDE>	 “Each of them will probably take a few weeks to complete” ._.
[13:51:20] <Lucas_WMDE>	 I’ve never run a maint script that long
[13:51:26] <Lucas_WMDE>	 just, open a tmux session on mwmaint, and let it rip?
[13:51:43] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930587 (owner: 10PipelineBot)
[13:51:50] <moritzm>	 !log installing ruby2.5 security updates
[13:51:50] <TheresNoTime>	 yup :D
[13:51:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy)
[13:51:58] <Lucas_WMDE>	 ok
[13:52:00] <MatmaRex>	 yes
[13:52:06] <TheresNoTime>	 ~~screen > tmux but ok~~
[13:52:13] <MatmaRex>	 i think you can follow what urbanec.m did with the last script
[13:52:18] <Lucas_WMDE>	 !.kb TheresNoTime
[13:52:18] <MatmaRex>	 https://phabricator.wikimedia.org/T315510#8929374
[13:52:25] <TheresNoTime>	 >:D
[13:52:31] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930587 (owner: 10PipelineBot)
[13:52:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Temporarily disable UCoC link from non tech wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy)
[13:52:54] <claime>	 TheresNoTime: The 90s called, they want their terminal multiplexer back <3
[13:53:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:924567|Revert "Temporarily disable UCoC link from non tech wikis" (T280886)]]
[13:53:07] <stashbot>	 T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886
[13:53:28] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[13:53:33] <TheresNoTime>	 claime: I just know how it works without having to look anything up D:
[13:53:35] <wikibugs>	 (03Merged) 10jenkins-bot: svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan)
[13:53:35] <Lucas_WMDE>	 brb creating puppet change to uninstall screen /s
[13:53:52] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[13:54:02] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[13:54:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 reedy and lucaswerkmeister-wmde: Backport for [[gerrit:924567|Revert "Temporarily disable UCoC link from non tech wikis" (T280886)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[13:54:33] <claime>	 Lucas_WMDE: Better idea, deploy a global tmuxrc that remaps everything to screen bindings
[13:54:35] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[13:54:39] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[13:54:41] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[13:54:46] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[13:54:57] <Lucas_WMDE>	 I see a code of conduct link on https://en.wikipedia.org/wiki/Main_Page on mwdebug
[13:55:26] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[13:55:30] <Lucas_WMDE>	 and a Verhaltenskodex at https://de.wikipedia.org/wiki/Wikipedia:Hauptseite
[13:55:41] <Lucas_WMDE>	 should be good to go I think
[13:55:48] <icinga-wm>	 RECOVERY - Check systemd state on mw1448 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:56:01] <Winston_Sung[m]>	 <Lucas_WMDE> "Winston_Sung: was the idea in..." <- Lucas_WMDE:  The backport should be done after group 2 to wmf.13.
[13:56:02] <Lucas_WMDE>	 claime: I actually use C-a instead of C-b for tmux ^^
[13:56:07] <Lucas_WMDE>	 (but don’t know any other screen bindings)
[13:56:18] <Lucas_WMDE>	 Winston_Sung[m]: ok, then now would be the right time
[13:56:25] <Lucas_WMDE>	 (syncing the UCoC change now)
[13:56:48] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Amazing work, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar)
[14:00:15] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5024.eqsin.wmnet
[14:00:31] <moritzm>	 !log remove ruby2.5 2.5.5-3+deb10u5+wmf1 (superseded by corrected Debian build 2.5.5-3+deb10u6 T338294
[14:00:34] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5032.eqsin.wmnet
[14:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:35] <stashbot>	 T338294: ruby2.5 2.5.5-3+deb10u5 breaks Puppet - https://phabricator.wikimedia.org/T338294
[14:01:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:924567|Revert "Temporarily disable UCoC link from non tech wikis" (T280886)]] (duration: 08m 44s)
[14:01:52] <stashbot>	 T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886
[14:02:04] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Revert "Implement Language Converter for yue (Cantonese)" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929647 (https://phabricator.wikimedia.org/T59106) (owner: 10Winston Sung)
[14:02:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929647 (https://phabricator.wikimedia.org/T59106) (owner: 10Winston Sung)
[14:07:16] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:07:16] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:07:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:08:14] <jkieserman>	 Hello! I'm trying to rotate an API key on a mwmaint server. I believe I need to PrivateSettings.php, but can't seem to find that file?
[14:08:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[14:11:56] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:11:56] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:12:53] <taavi>	 jkieserman: if you want to update a PS value, you would need to update the canonical copy on deployment.eqiad.wmnet and deploy it with scap, any changes anywhere else will get overwritten
[14:12:56] <gehel>	 ryankemper, inflatador: should wdqs2021 be silenced?
[14:13:39] <wikibugs>	 10SRE-tools, 10Spicerack: Service without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Clement_Goubert)
[14:16:31] <jkieserman>	 Thanks taavi! A few follow-up questions. (1) we update but sshing into deployment.eqiad.wmnet? (2) where does the PS file live on that server? (3) How do we deploy? (Sorry, total newb :) )
[14:17:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:09] <wikibugs>	 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) Thanks for the feedback! Is there a test cluster that wmcs can connect to that we might be able to use with a test instance of gitlab in order to give it a try before we do thi...
[14:19:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Implement Language Converter for yue (Cantonese)" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929647 (https://phabricator.wikimedia.org/T59106) (owner: 10Winston Sung)
[14:19:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:929647|Revert "Implement Language Converter for yue (Cantonese)" (T59106 T337527)]]
[14:19:51] <stashbot>	 T59106: Implement LanguageConverter for yue (Cantonese) - https://phabricator.wikimedia.org/T59106
[14:19:51] <stashbot>	 T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527
[14:20:22] <taavi>	 jkieserman: (1) not sure what you're asking here, sorry (2) /srv/mediawiki-staging/private/ (3) you should presumably find someone with deployment rights and experience, for example show up here during a backport window
[14:20:30] <taavi>	 ah, they left :/
[14:21:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and wsung: Backport for [[gerrit:929647|Revert "Implement Language Converter for yue (Cantonese)" (T59106 T337527)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[14:21:17] <Lucas_WMDE>	 Winston_Sung[m]: can you test it?
[14:21:34] <Winston_Sung[m]>	 Testing...
[14:22:58] <Winston_Sung[m]>	 Everything looks fine for me.
[14:23:06] <Lucas_WMDE>	 checking logstash just to be safe
[14:23:36] <Lucas_WMDE>	 nothing that looks particularly concerning
[14:23:37] <Lucas_WMDE>	 let’s sync
[14:23:43] <Winston_Sung[m]>	 No console errors, network all HTTP 200.
[14:23:56] <Winston_Sung[m]>	 * HTTP 200 OK.
[14:24:03] <Winston_Sung[m]>	 Yeah. Let's sync.
[14:25:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: pdns: recursor: drop IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/930616 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez)
[14:25:12] <MatmaRex>	 (i need to step away for a minute, i hope you can still launch my maintenance. thanks)
[14:25:24] <Lucas_WMDE>	 MatmaRex: yup, wiil do
[14:25:27] <Lucas_WMDE>	 *will
[14:25:38] <wikibugs>	 (03PS1) 10Eigyan: Remove GDI survey from RU and JA wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926)
[14:26:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye
[14:26:32] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye
[14:27:26] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:27:43] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: attempt to render hypenated svg languages better [deployment-charts] - 10https://gerrit.wikimedia.org/r/930641 (https://phabricator.wikimedia.org/T337139)
[14:29:00] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:29:06] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: attempt to render hypenated svg languages better [deployment-charts] - 10https://gerrit.wikimedia.org/r/930641 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan)
[14:29:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:929647|Revert "Implement Language Converter for yue (Cantonese)" (T59106 T337527)]] (duration: 09m 53s)
[14:29:40] <Lucas_WMDE>	 re maintenance script: from https://phabricator.wikimedia.org/T315510#8716277 and https://phabricator.wikimedia.org/T326314, I’m guessing that I should not start with s1, s4 or s8
[14:29:44] <stashbot>	 T59106: Implement LanguageConverter for yue (Cantonese) - https://phabricator.wikimedia.org/T59106
[14:29:44] <stashbot>	 T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527
[14:29:45] <Lucas_WMDE>	 since those three are busy backfilling externallinks
[14:29:54] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: attempt to render hypenated svg languages better [deployment-charts] - 10https://gerrit.wikimedia.org/r/930641 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan)
[14:29:56] <Lucas_WMDE>	 but maybe I can do s2 and s3 in parallel, for instance
[14:30:07] <Lucas_WMDE>	 any objections?
[14:32:46] <MatmaRex>	 yes, that sounds reasonable
[14:33:01] <MatmaRex>	 i also suggested s5 and s6, since the externallinks work is done there as well
[14:33:29] <Lucas_WMDE>	 and urbanec.m is already running s7, I see
[14:34:16] <Lucas_WMDE>	 !log Start `foreachwikiindblist 'group2 & s2' DiscussionTools:persistRevisionThreadItems --current --all; touch ~/T315510-s2-exited-$?` in tmux on mwmaint1002 (T315510)
[14:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:20] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[14:35:27] <Lucas_WMDE>	 !log Start `foreachwikiindblist 'group2 & s3' DiscussionTools:persistRevisionThreadItems --current --all; touch ~/T315510-s3-exited-$?` in tmux on mwmaint1002 (T315510)
[14:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:36] <wikibugs>	 (03CR) 10Klausman: changeprop: remove match on specific wiki_id for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[14:39:03] <Lucas_WMDE>	 !log Start `foreachwikiindblist 'group2 & s5' DiscussionTools:persistRevisionThreadItems --current --all; touch ~/T315510-s5-exited-$?` in tmux on mwmaint1002 (T315510)
[14:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:31] <Lucas_WMDE>	 !log Start `foreachwikiindblist 'group2 & s6' DiscussionTools:persistRevisionThreadItems --current --all; touch ~/T315510-s6-exited-$?` in tmux on mwmaint1002 (T315510)
[14:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:34] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[14:40:03] <Lucas_WMDE>	 !log UTC afternoon backport+config window done (maintenance script runs are ongoing and “will probably take a few weeks to complete”)
[14:40:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:12] <MatmaRex>	 thanks Lucas_WMDE
[14:40:14] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: recursor: make it listen in the right address [puppet] - 10https://gerrit.wikimedia.org/r/930642 (https://phabricator.wikimedia.org/T307357)
[14:41:01] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add more experimental settings for LLMs [deployment-charts] - 10https://gerrit.wikimedia.org/r/930632 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[14:43:11] <Lucas_WMDE>	 o_O cebwiki and frwiki both have about 11 million rows to update apparently
[14:43:31] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339168 (10Jclark-ctr) 05Open→03Resolved Replaced Managment switch
[14:43:40] <Lucas_WMDE>	 oh wait, this is DiscussionTools, not Flow
[14:43:43] <Lucas_WMDE>	 then it makes sense I guess
[14:44:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/930642/41748/" [puppet] - 10https://gerrit.wikimedia.org/r/930642 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[14:47:29] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Clement_Goubert)
[14:48:02] <wikibugs>	 (03CR) 10Jkieserman: [C: 03+1] Remove GDI survey from RU and JA wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan)
[14:50:38] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Use refinery v0.2.16 in refine jobs. [puppet] - 10https://gerrit.wikimedia.org/r/928525 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu)
[14:51:49] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2401.codfw.wmnet with reason: powering off for T326564
[14:51:53] <stashbot>	 T326564: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564
[14:52:02] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2401.codfw.wmnet with reason: powering off for T326564
[14:52:07] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2411.codfw.wmnet with reason: powering off for T326564
[14:52:20] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2411.codfw.wmnet with reason: powering off for T326564
[14:52:34] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2324.codfw.wmnet with reason: powering off for T326564
[14:52:47] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2324.codfw.wmnet with reason: powering off for T326564
[14:52:47] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar)
[14:52:52] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2323.codfw.wmnet with reason: powering off for T326564
[14:53:15] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2323.codfw.wmnet with reason: powering off for T326564
[14:53:48] <claime>	 !log Depooling mw2401 mw2411 mw2324 mw2323 as invalid for powerdown - T326564
[14:53:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:22] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2401.codfw.wmnet
[14:54:39] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2411.codfw.wmnet
[14:54:50] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2324.codfw.wmnet
[14:54:57] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2323.codfw.wmnet
[14:55:18] <claime>	 !log Powering down mw2401 mw2411 mw2324 mw2323 - T326564
[14:55:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:44] <icinga-wm>	 PROBLEM - puppet last run on puppetdb1003 is CRITICAL: CRITICAL: Puppet has been disabled for 604823 seconds, message: testing multi ca support - jbond, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:56:43] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Jhancock.wm)
[14:56:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[14:57:18] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) a:03hashar That is a recurring issue cause the Jenkins jobs are running on static hosts  which are not always entirely cleared up after a...
[14:58:34] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar)
[15:00:34] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: codfw1dev: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930647 (https://phabricator.wikimedia.org/T307357)
[15:00:36] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357)
[15:00:37] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:00:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:01:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: codfw1dev: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930647 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[15:01:46] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) > The apt cache overflowing, I don't think it is garbage collected  `/srv` is 21G on the instances and:  | Disk size in MB | Directory |--|...
[15:01:51] <claime>	 jouncebot: nowandnext
[15:01:51] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 58 minute(s)
[15:01:51] <jouncebot>	 In 0 hour(s) and 58 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1600)
[15:01:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[15:03:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert)
[15:04:06] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero)
[15:04:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Clement_Goubert) 05Open→03Resolved Host is back in pool, resolving.
[15:04:30] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert)
[15:06:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] dev env: add a basic puppet enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:07:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:09:00] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure: Puppet package_builder module should have a cronjob to clear the apt cache - https://phabricator.wikimedia.org/T339251 (10hashar)
[15:09:43] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic, 10ci-test-error: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) 05Open→03Resolved I have manually deleted the apt caches which were taking half of the disk space and are never purg...
[15:09:46] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:09:59] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:10:00] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:10:10] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:10:30] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:10:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:11:04] <claime>	 ^ the above alert is my fault
[15:11:06] <claime>	 fixing
[15:11:16] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:30] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:11:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:12:12] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:12:28] <claime>	 !log Deploying new mediawiki chart: Gracefully handle termination - T331609
[15:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:32] <stashbot>	 T331609: Gracefully handle pod termination in mw-on-k8s - https://phabricator.wikimedia.org/T331609
[15:12:36] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[15:13:22] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[15:13:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:13:40] <wikibugs>	 (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[15:14:34] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[15:14:34] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[15:14:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[15:14:58] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[15:16:21] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2401.codfw.wmnet
[15:16:21] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2401.codfw.wmnet
[15:16:41] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2401.codfw.wmnet
[15:16:52] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[15:16:52] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[15:17:05] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[15:17:11] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[15:18:01] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[15:18:38] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[15:20:11] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure: Puppet package_builder module should have a cronjob to clear the apt cache - https://phabricator.wikimedia.org/T339251 (10hashar) pbuilder(8) has an option to clean it automatically:    --autocleanaptcache     Clean  apt cache automatically, to run `apt-get autoc...
[15:20:51] <wikibugs>	 (03PS1) 10Hashar: package_builder: autoclean apt cache [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251)
[15:21:16] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[15:21:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[15:21:55] <claime>	 !log mw2401.codfw.wmnet repooled following T326564
[15:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:59] <stashbot>	 T326564: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564
[15:22:09] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[15:22:16] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:22:35] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar)
[15:23:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:23:24] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:46] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[15:24:20] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: acme_chief: openstack: codfw1dev: allow cloudservices2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/930654 (https://phabricator.wikimedia.org/T307357)
[15:24:25] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2411.codfw.wmnet
[15:24:25] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2411.codfw.wmnet
[15:24:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[15:25:26] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] dev env: add a basic puppet enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:26:07] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2411.codfw.wmnet
[15:26:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "you cannot issue Let's Encrypt certificates for internal domains (.wmnet TLD)" [puppet] - 10https://gerrit.wikimedia.org/r/930654 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[15:26:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) @Dzahn Yeah, I created this as a sub-task for that. I will close this first and create another sub-task under (T295416) for decom otrs1001.
[15:27:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10Arnoldokoth)
[15:27:24] <wikibugs>	 (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/930653/1965/ gives:" [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar)
[15:27:25] <claime>	 !log mw2411.codfw.wmnet repooled following T326564
[15:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:28] <stashbot>	 T326564: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564
[15:27:57] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: acme_chief: openstack: codfw1dev: allow cloudservices2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/930654 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[15:28:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:28:08] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[15:28:14] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Puppet package_builder module should have a cronjob to clear the apt cache - https://phabricator.wikimedia.org/T339251 (10hashar) a:03hashar
[15:28:21] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[15:28:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: Decom otrs1001 - https://phabricator.wikimedia.org/T339253 (10Arnoldokoth)
[15:29:11] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul)
[15:30:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) 05Resolved→03Open >>! In T338566#8933725, @elukey wrote: > @Papaul thanks! I confirm that it works :) >  > I think that there is only one thing to do, namely update the document...
[15:30:36] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans)
[15:31:37] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[15:33:01] <wikibugs>	 (03PS1) 10Muehlenhoff: ferm: Allow passing the port is a more structured way (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/930656
[15:33:09] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[15:33:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/930656 (owner: 10Muehlenhoff)
[15:33:53] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[15:33:57] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: name=mw2324.codfw.wmnet
[15:34:54] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good and thank you for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar)
[15:35:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Clement_Goubert) >>! In T338566#8935469, @elukey wrote: >>>! In T338566#8933725, @elukey wrote: >> @Papaul thanks! I confirm that it works :) >>  >> I think that there is only one thing to...
[15:36:36] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2324.codfw.wmnet
[15:36:36] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2324.codfw.wmnet
[15:36:45] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2324.codfw.wmnet
[15:37:21] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@106bf30]: Patch for HiveToDruid with snapshots
[15:37:24] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul)
[15:37:24] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2324 is CRITICAL: Host mw2324 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:37:36] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add more experimental settings for LLMs [deployment-charts] - 10https://gerrit.wikimedia.org/r/930632 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[15:37:42] <claime>	 ^that's me, it'll fix itself
[15:38:02] <wikibugs>	 (03PS2) 10Muehlenhoff: ferm: Allow passing the port is a more structured way (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/930656
[15:38:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[15:38:36] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Dovi) I concur with [[User:Fuzzy]]; a direct solution to this is needed on Hebrew Wikisource.
[15:38:52] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, one question" [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff)
[15:39:02] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2324 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:39:04] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) 05Open→03Resolved This is complete, thanks to @ssingh and @Clement_Goubert
[15:39:25] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2323.codfw.wmnet
[15:41:14] <wikibugs>	 (03CR) 10Elukey: changeprop: remove match on specific wiki_id for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[15:43:30] <claime>	 !log mw2324.codfw.wmnet repooled following T326564
[15:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:34] <stashbot>	 T326564: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564
[15:44:22] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@106bf30]: Patch for HiveToDruid with snapshots (duration: 07m 01s)
[15:44:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2323.codfw.wmnet
[15:44:43] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2323.codfw.wmnet
[15:44:50] <claime>	 !log mw2323.codfw.wmnet repooled following T326564
[15:44:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add more experimental settings for LLMs [deployment-charts] - 10https://gerrit.wikimedia.org/r/930632 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[15:45:57] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@106bf30] (thin): Patch for HiveToDruid with snapshots [thin]
[15:46:01] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@106bf30] (thin): Patch for HiveToDruid with snapshots [thin] (duration: 00m 04s)
[15:46:32] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:41] <wikibugs>	 (03CR) 10Muehlenhoff: Provided a dedicated KDC logrotate config and fix service reload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff)
[15:51:18] <mutante>	 !log phabricator - made jnuche (https://phabricator.wikimedia.org/people/manage/32076/) an Administrator T339174
[15:51:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:22] <stashbot>	 T339174: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174
[15:51:57] <wikibugs>	 (03PS5) 10JHathaway: apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279)
[15:53:00] <wikibugs>	 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) {F37104995}  ^ ;)
[15:53:06] <wikibugs>	 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn)
[15:53:45] <wikibugs>	 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) @Aklapper see logs and screenshot above:) can you click for me? then this is resolved
[15:54:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[15:54:35] <wikibugs>	 (03CR) 10Ssingh: "This is ready for review and also running on traffic-cache-bullseye:" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[15:55:20] <icinga-wm>	 PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:56:13] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@c584b62]: (no justification provided)
[15:56:25] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@c584b62]: (no justification provided) (duration: 00m 12s)
[15:57:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[15:57:17] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Provided a dedicated KDC logrotate config and fix service reload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff)
[15:57:31] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway)
[15:58:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:58:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:59:15] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: acme_chief: openstack: codfw1dev: refresh LDAP certificates [puppet] - 10https://gerrit.wikimedia.org/r/930661 (https://phabricator.wikimedia.org/T307357)
[15:59:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM, thanks for working on this!" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[16:00:05] <jouncebot>	 jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:06:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good, take into account that cloudservices2005-dev.wikimedia.org will lose the current certificate as soon as this gets merged and a" [puppet] - 10https://gerrit.wikimedia.org/r/930661 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[16:06:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: openstack: codfw1dev: refresh LDAP certificates [puppet] - 10https://gerrit.wikimedia.org/r/930661 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez)
[16:10:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:10:32] <vgutierrez>	 sigh...
[16:10:33] <vgutierrez>	 https://letsencrypt.status.io/pages/55957a99e800baa4470002da
[16:10:37] <vgutierrez>	 ^^ arturo 
[16:11:11] <arturo>	 vgutierrez: hopefully I didn't break it :-^
[16:11:42] <vgutierrez>	 nah.. but acme-chief isn't happy with Let's Encrypt being down
[16:11:51] <arturo>	 I can imagine
[16:12:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I closed https://phabricator.wikimedia.org/T337382 optimistically" [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper)
[16:12:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/404 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede
[16:12:52] <icinga-wm>	 PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:12:58] <icinga-wm>	 PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief
[16:13:10] <vgutierrez>	 that's kinda expected sadly :)
[16:13:25] <icinga-wm>	 PROBLEM - Check unit status of acme-chief #page on acmechief2001 is CRITICAL: CRITICAL: Status of the systemd unit acme-chief https://wikitech.wikimedia.org/wiki/Acme-chief%23Monitoring
[16:13:52] * Emperor here from the p.age
[16:14:01] <vgutierrez>	 nothing to worry about
[16:14:13] <brett>	 ack
[16:14:18] <Emperor>	 good-oh :)
[16:14:51] <brett>	 acked the page
[16:14:56] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for acmechief2001.codfw.wmnet
[16:14:56] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2001.codfw.wmnet
[16:15:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:15:10] <vgutierrez>	 LOL... wrong cookbook
[16:15:56] <cdanis>	 vgutierrez: how did you know the exact minute I walked away for lunch
[16:15:58] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on acmechief2001.codfw.wmnet with reason: https://letsencrypt.status.io/pages/55957a99e800baa4470002da
[16:16:00] <cdanis>	 😂
[16:16:11] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2001.codfw.wmnet with reason: https://letsencrypt.status.io/pages/55957a99e800baa4470002da
[16:17:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/404 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee
[16:25:58] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Nahum) The Income Tax Ordinance requires a temporrary immediate solution while we continue to ponder the best permanent one.
[16:26:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Release 9.2.1-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[16:28:23] <wikibugs>	 (03PS1) 10Hnowlan: images: log key limited by poolcounter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930664 (https://phabricator.wikimedia.org/T337649)
[16:30:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[16:34:02] <icinga-wm>	 RECOVERY - Check systemd state on vrts2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:30] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339264 (10phaultfinder)
[16:37:30] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:38:39] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on acmechief2001.codfw.wmnet with reason: https://letsencrypt.status.io/pages/55957a99e800baa4470002da
[16:38:41] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on acmechief2001.codfw.wmnet with reason: https://letsencrypt.status.io/pages/55957a99e800baa4470002da
[16:38:58] <vgutierrez>	 refreshed the downtime with a 24h one
[16:40:01] <sukhe>	 vgutierrez: ,3
[16:40:02] <sukhe>	 <3
[16:41:27] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] role::cache::{text,upload}: move vk instances to PKI in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/930633 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[16:44:24] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:44:31] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[16:44:41] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) 05Open→03In progress Note: I started to boostrap the node with instructions from https://wikitech.wikimedia.org/wik...
[16:45:50] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:48:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly OTRS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10jeremyb-phone)
[16:48:52] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) Also, `designate-producer` is complaining about something related to rabbitmq, possibly related to the new IP address:...
[16:50:29] <wikibugs>	 (03CR) 10AikoChou: changeprop: remove match on specific wiki_id for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[16:51:02] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002"
[16:51:24] <icinga-wm>	 RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief
[16:52:02] <logmsgbot>	 !log aborrero@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002"
[16:52:03] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices2004-dev.codfw.wmnet with OS bullseye
[16:52:14] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.wmnet...
[16:52:50] <icinga-wm>	 RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:52] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.53.0" for 595 hosts
[16:55:15] <icinga-wm>	 RECOVERY - Check unit status of acme-chief #page on acmechief2001 is OK: OK: Status of the systemd unit acme-chief https://wikitech.wikimedia.org/wiki/Acme-chief%23Monitoring
[16:55:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) (owner: 10Chad)
[16:55:48] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.53.0" for 595 hosts
[16:59:07] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.53.0" for 594 hosts
[16:59:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen): Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10hashar) 05Open→03Resolved a:03Legoktm That was implemented by @...
[17:00:04] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.53.0" completed for 594 hosts
[17:00:05] <jouncebot>	 bd808: It is that lovely time of the day again! You are hereby commanded to deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1700)
[17:00:54] <bd808>	 I should have a developer-portal version to deploy today I think... /me looks
[17:01:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen): Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10hashar) (I think that task was left open to have the list of hosts pa...
[17:02:15] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@bba655e]: (no justification provided)
[17:02:27] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@bba655e]: (no justification provided) (duration: 00m 11s)
[17:04:48] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-06-15-114340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930667
[17:06:14] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Update mappings for some countries based on initial Probenet data [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar)
[17:06:54] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[17:07:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10Dzahn) also see T218686 (Create Gerrit Administrator right policy)
[17:07:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10Dzahn) also see T218686 (Create Gerrit Administrator right policy)
[17:07:43] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Puppet package_builder module should have the apt cache auto cleaned - https://phabricator.wikimedia.org/T339251 (10hashar)
[17:07:50] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-06-15-114340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930667 (owner: 10BryanDavis)
[17:08:10] <wikibugs>	 10SRE, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Seen): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10Dzahn) Priority was set to low. Just came up once again though with linked requests.
[17:08:38] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-06-15-114340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930667 (owner: 10BryanDavis)
[17:09:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10Dzahn) +1 to adding Andre, for sure. clinic duty can resolve this like other LDAP group requests.
[17:10:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10Dzahn) clinic duty can resolve this like other LDAP access requests
[17:10:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/929400/41751/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) (owner: 10Chad)
[17:11:49] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:12:11] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:12:23] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:12:52] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:13:01] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:13:35] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:15:04] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:16:17] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: HelpCompletionTool wasn't added to extension.json [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930541 (https://phabricator.wikimedia.org/T338254)
[17:17:01] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Aklapper) p:05High→03Medium [[ https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities | The Pr...
[17:19:28] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:20:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10hashar) That follows up @Aklapper joining #releng which is owning the #gerrit service. @thcipriani is the team manager thus I guess him filing the task se...
[17:20:34] <wikibugs>	 (03PS1) 10Joal: Move spark_jobs from spark2 to spark3 [puppet] - 10https://gerrit.wikimedia.org/r/930669
[17:21:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10hashar) Gerrit Administrators are managed via LDAP `gerritadmin` LDAP group.  Thank you to have filed the task which is nice for history purposes. I think it is pretty much sel...
[17:22:01] <wikibugs>	 10SRE, 10Gerrit, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10hashar)
[17:22:07] <wikibugs>	 10SRE, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10hashar)
[17:32:45] <wikibugs>	 (03PS3) 10Hokwelum: Fix up more things in the README for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/928605 (https://phabricator.wikimedia.org/T325232)
[17:32:47] <wikibugs>	 (03PS5) 10Hokwelum: Modify the global blocks script to accept output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861
[17:32:49] <wikibugs>	 (03PS1) 10Hokwelum: make snapshot101[67] temporary testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/930671
[17:43:33] <wikibugs>	 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10taavi) 05In progress→03Resolved I'm not Andre but done.
[17:45:19] <wikibugs>	 (03PS2) 10Snwachukwu: Move spark_jobs from spark2 to spark3 [puppet] - 10https://gerrit.wikimedia.org/r/930669 (owner: 10Joal)
[17:52:31] <wikibugs>	 (03PS1) 10Ladsgroup: BlockedDomains: Add logging in case of hit [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930542 (https://phabricator.wikimedia.org/T337431)
[17:52:41] <Amir1>	 jouncebot: nowandnext
[17:52:42] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1700)
[17:52:42] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1700)
[17:52:42] <jouncebot>	 In 0 hour(s) and 7 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1800)
[17:53:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] BlockedDomains: Add logging in case of hit [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930542 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup)
[17:54:56] <wikibugs>	 (03PS1) 10Ladsgroup: Enable blocked domain list in testwiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930672 (https://phabricator.wikimedia.org/T337431)
[17:57:36] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:58:50] <wikibugs>	 (03PS1) 10Hashar: zuul: replace zuul-gearman.py by gearman-tools [puppet] - 10https://gerrit.wikimedia.org/r/930673 (https://phabricator.wikimedia.org/T339172)
[17:59:04] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:59:10] <wikibugs>	 (03PS2) 10Hashar: zuul: replace zuul-gearman.py by gearman-tools [puppet] - 10https://gerrit.wikimedia.org/r/930673 (https://phabricator.wikimedia.org/T339172)
[18:00:06] <jouncebot>	 jnuche and jeena: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1800). Please do the needful.
[18:01:37] <wikibugs>	 (03PS3) 10Snwachukwu: Move spark_jobs from spark2 to spark3 [puppet] - 10https://gerrit.wikimedia.org/r/930669 (https://phabricator.wikimedia.org/T335308) (owner: 10Joal)
[18:08:40] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586)
[18:08:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[18:09:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) (owner: 10Andrew Bogott)
[18:09:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] BlockedDomains: Add logging in case of hit [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930542 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup)
[18:10:16] <wikibugs>	 (03PS2) 10Andrew Bogott: magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586)
[18:12:00] <wikibugs>	 (03Merged) 10jenkins-bot: BlockedDomains: Add logging in case of hit [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930542 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup)
[18:13:09] <wikibugs>	 (03PS3) 10Andrew Bogott: magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586)
[18:13:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) (owner: 10Andrew Bogott)
[18:13:39] <wikibugs>	 (03PS4) 10Andrew Bogott: magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586)
[18:13:56] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930542|BlockedDomains: Add logging in case of hit (T337431)]]
[18:14:00] <stashbot>	 T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431
[18:14:50] <wikibugs>	 (03CR) 10Andrew Bogott: "The puppet manifest that applies the patch is a nightmare but can you check the diffs to make sure I didn't miss a line and/or reverse the" [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) (owner: 10Andrew Bogott)
[18:17:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:23:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:25:12] <Amir1>	 sigh, I lost connection to deploy1002 
[18:25:58] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930542|BlockedDomains: Add logging in case of hit (T337431)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[18:26:02] <stashbot>	 T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431
[18:28:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:35:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:36:02] <Amir1>	 claime: 18:35:30 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2289.codfw.wmnet', 'mw2300.codfw.wmnet', 'mw1398.eqiad.wmnet', 'mw2259.codfw.wmnet', 'mw1420.eqiad.wmnet', 'mw1486.eqiad.wmnet', 'mw1404.eqiad.wmnet', 'mw1366.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'deploy2002.codfw.wmnet'] (ran as mwdeploy@parse1002.eqiad.wmnet) returned [255]: ssh: connect to host 
[18:36:03] <Amir1>	 parse1002.eqiad.wmnet port 22: Connection timed out
[18:36:10] <Amir1>	 I honestly think this needs a hw check
[18:36:56] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:37:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:37:58] <wikibugs>	 10SRE, 10Gerrit, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10Dzahn) Nah, it's not self-service for SRE. At least not anymore since a certain incident in the past, when sre was specifically removed from gerritadmins and that's...
[18:38:28] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:40:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:42:50] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:44:18] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:44:29] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930542|BlockedDomains: Add logging in case of hit (T337431)]] (duration: 30m 33s)
[18:44:33] <stashbot>	 T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431
[18:44:52] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.*
[18:45:06] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:45:38] <wikibugs>	 (03PS1) 10Gmodena: mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233)
[18:46:13] <wikibugs>	 (03PS1) 10Andrew Bogott: Heat and Magnum: include service token with subcalls [puppet] - 10https://gerrit.wikimedia.org/r/930676 (https://phabricator.wikimedia.org/T333874)
[18:47:28] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable blocked domain list in testwiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930672 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup)
[18:48:10] <ryankemper>	 !log [WDQS] `ryankemper@wdqs2012:~$ sudo pool`
[18:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:18] <wikibugs>	 (03Merged) 10jenkins-bot: Enable blocked domain list in testwiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930672 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup)
[18:48:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930672 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup)
[18:48:32] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930672|Enable blocked domain list in testwiki and fawiki (T337431)]]
[18:48:42] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:48:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Heat and Magnum: include service token with subcalls [puppet] - 10https://gerrit.wikimedia.org/r/930676 (https://phabricator.wikimedia.org/T333874) (owner: 10Andrew Bogott)
[18:49:09] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Update mappings for some countries based on initial Probenet data [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar)
[18:49:32] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:50:07] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930672|Enable blocked domain list in testwiki and fawiki (T337431)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[18:50:11] <stashbot>	 T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431
[18:51:17] <wikibugs>	 (03PS2) 10Gmodena: mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233)
[18:52:13] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] query_service: migrate WDQS to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) (owner: 10Gehel)
[18:53:01] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] query_service: migrate WDQS to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) (owner: 10Gehel)
[18:53:21] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "multiple issues according to PCC, I'll check back once the parent CR is merged." [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel)
[18:55:08] <wikibugs>	 (03PS3) 10Gmodena: mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233)
[18:56:28] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:56:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[18:57:56] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:00:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:02:33] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans)
[19:03:00] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:05:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:05:46] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:06:13] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930672|Enable blocked domain list in testwiki and fawiki (T337431)]] (duration: 17m 40s)
[19:06:17] <stashbot>	 T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431
[19:08:56] <wikibugs>	 (03PS1) 10Milimetric: Revert "Revert "Bump mediawiki_history_reduced version for aqs"" [puppet] - 10https://gerrit.wikimedia.org/r/930543
[19:09:08] <wikibugs>	 (03PS2) 10Milimetric: Revert "Revert "Bump mediawiki_history_reduced version for aqs"" [puppet] - 10https://gerrit.wikimedia.org/r/930543
[19:13:22] <wikibugs>	 (03PS1) 10Gehel: query_service: fix logging configuration for wdqs updater [puppet] - 10https://gerrit.wikimedia.org/r/930678
[19:15:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:16:45] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] query_service: fix logging configuration for wdqs updater [puppet] - 10https://gerrit.wikimedia.org/r/930678 (owner: 10Gehel)
[19:16:47] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] query_service: fix logging configuration for wdqs updater [puppet] - 10https://gerrit.wikimedia.org/r/930678 (owner: 10Gehel)
[19:19:21] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Jhancock.wm)
[19:20:24] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:23:30] <wikibugs>	 (03PS2) 10Btullis: Update the mediawiki_history_reduced snapshot to AQS [puppet] - 10https://gerrit.wikimedia.org/r/930620
[19:25:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the mediawiki_history_reduced snapshot to AQS [puppet] - 10https://gerrit.wikimedia.org/r/930620 (owner: 10Btullis)
[19:25:40] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:28:38] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:31:31] <wikibugs>	 (03CR) 10EllenR: [C: 03+1] "you got a +2, but I'll add my 2 cents" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan)
[19:32:03] <wikibugs>	 (03PS1) 10Andrew Bogott: neutron policy: policy rules to permit members to create magnum clusters [puppet] - 10https://gerrit.wikimedia.org/r/930681 (https://phabricator.wikimedia.org/T333874)
[19:33:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] neutron policy: policy rules to permit members to create magnum clusters [puppet] - 10https://gerrit.wikimedia.org/r/930681 (https://phabricator.wikimedia.org/T333874) (owner: 10Andrew Bogott)
[19:38:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[19:39:07] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339264 (10Jclark-ctr) a:03Jclark-ctr
[19:40:41] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:41:51] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:45:48] <wikibugs>	 (03PS1) 10Superpes15: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826)
[19:46:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15)
[19:47:18] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T339178 (10Jhancock.wm) tested connection. can ssh into the management port. resolve.
[19:47:57] <wikibugs>	 (03PS2) 10Superpes15: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826)
[19:48:47] <wikibugs>	 (03PS3) 10Superpes15: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826)
[19:49:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15)
[19:52:07] <Superpes>	 Uhm "Unexpected ';', expecting ']' in ./wmf-config/core-Permissions.php on line 5536"
[19:53:33] <TheresNoTime>	 oops
[19:53:34] <TheresNoTime>	 :p
[19:53:57] <taavi>	 you seem to be accidentally removing the closing bracket for the `eliminator` group
[19:54:35] <Superpes>	 Uhm do you mean in line 4263?
[19:54:51] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:55:27] <Superpes>	 I just fixed the indentation...
[19:55:29] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:56:25] <taavi>	 no, 2583
[19:56:42] <Superpes>	 Oh
[19:57:43] <wikibugs>	 (03PS4) 10Superpes15: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826)
[19:58:14] <eigyan>	 Greetings All!
[19:58:48] <wikibugs>	 10SRE, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Integrate In-App Internet censorship circumvention by domain fronting - https://phabricator.wikimedia.org/T327286 (10ZauberViolino) Is the Wikipedia app is available on Apple's App Store? (My iPad region is US so I cannot check...
[19:59:28] <Superpes>	 Lol fixed thanks taavi didn't see it at all :D
[19:59:43] <wikibugs>	 (03Abandoned) 10Milimetric: Revert "Revert "Bump mediawiki_history_reduced version for aqs"" [puppet] - 10https://gerrit.wikimedia.org/r/930543 (owner: 10Milimetric)
[19:59:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:00:07] <jouncebot>	 brennen and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T2000).
[20:00:07] <jouncebot>	 eigyan, MatmaRex, and Superpes: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:23] <MatmaRex>	 hi
[20:00:34] <eigyan>	 o/
[20:00:59] * TheresNoTime looks for brennen 
[20:01:37] <thcipriani>	 I can do this if you need someone to fill in TheresNoTime 
[20:01:54] <TheresNoTime>	 thcipriani: if you wouldn't mind, thank you :)
[20:02:01] <thcipriani>	 no problem, on it
[20:02:29] <wikibugs>	 (03PS2) 10Thcipriani: Remove GDI survey from RU and JA wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan)
[20:02:57] <thcipriani>	 eigyan: I'll start with your
[20:02:58] <thcipriani>	 s
[20:03:10] <eigyan>	 Many thanks thcipriani
[20:03:24] * thcipriani fumbles in window manager
[20:04:23] <MatmaRex>	 i think i'll need to add another change to the window, i'm preparing a revert
[20:04:40] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:05:19] <thcipriani>	 MatmaRex: k
[20:06:00] <thcipriani>	 checking into some logspam real quick before starting, sorry for delay
[20:08:47] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert "Targets: Use align:'after' instead of actionGroups" [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930545 (https://phabricator.wikimedia.org/T339292)
[20:09:03] <Superpes>	 Uhm Can't test my patch anymore
[20:09:23] <Superpes>	 If someone can.. otherwise I should schedule it next week!
[20:10:31] <MatmaRex>	 Superpes: what do you mean?
[20:10:54] <icinga-wm>	 RECOVERY - Host ps1-c6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms
[20:11:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena)
[20:12:00] <thcipriani>	 ok, going ahead, sorry for getting distracted by errors :P
[20:12:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan)
[20:13:25] <wikibugs>	 (03Merged) 10jenkins-bot: Remove GDI survey from RU and JA wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan)
[20:13:26] <MatmaRex>	 Superpes: the change looks straightforward to me, i think i can verify it once deployed :)
[20:13:40] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:930639|Remove GDI survey from RU and JA wikis. (T338926)]]
[20:13:44] <stashbot>	 T338926: Undeploy Community Safety Survey from RU and JA Wikipedias (est. on or after June 14th) - https://phabricator.wikimedia.org/T338926
[20:14:18] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15)
[20:15:03] <wikibugs>	 (03PS1) 10Andrew Bogott: neutron policy: more policy rule changes to support our shared network [puppet] - 10https://gerrit.wikimedia.org/r/930683 (https://phabricator.wikimedia.org/T333874)
[20:15:13] <logmsgbot>	 !log thcipriani@deploy1002 essexigyan and thcipriani: Backport for [[gerrit:930639|Remove GDI survey from RU and JA wikis. (T338926)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:15:57] <thcipriani>	 ^ eigyan on mwdebug, check please
[20:16:13] <eigyan>	 Excellent checking now
[20:16:18] <eigyan>	 thank you
[20:17:44] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339264 (10Jclark-ctr) Rebooted Msw
[20:18:30] <eigyan>	 All is well thcipriani thank you for your all you do!
[20:18:43] <thcipriani>	 great! going live everywhere now
[20:18:51] <eigyan>	 Suhweeet!
[20:20:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] neutron policy: more policy rule changes to support our shared network [puppet] - 10https://gerrit.wikimedia.org/r/930683 (https://phabricator.wikimedia.org/T333874) (owner: 10Andrew Bogott)
[20:22:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Move spark_jobs from spark2 to spark3 [puppet] - 10https://gerrit.wikimedia.org/r/930669 (https://phabricator.wikimedia.org/T335308) (owner: 10Joal)
[20:23:39] <thcipriani>	 > ssh: connect to host parse1002.eqiad.wmnet port 22: Connection timed out
[20:23:41] <thcipriani>	 hrmmmm
[20:23:55] <thcipriani>	 is that known? /me checks sal
[20:25:21] <thcipriani>	 doesn't look like anything is happening with it that's been logged
[20:25:38] <wikibugs>	 (03PS1) 10Ottomata: Remove reference to absent ::druid_load classes [puppet] - 10https://gerrit.wikimedia.org/r/930684 (https://phabricator.wikimedia.org/T335308)
[20:26:09] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Remove reference to absent ::druid_load classes [puppet] - 10https://gerrit.wikimedia.org/r/930684 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata)
[20:26:34] <jeena>	 thcipriani: we failed to intall scap to that this morning as well
[20:27:10] <jeena>	 jaime said it happens sometimes and the changes weren't relevant to it anyway
[20:27:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149']
[20:27:53] <thcipriani>	 jeena: oh, thanks for the note. That'd be nice to fix. Have to wait for timeouts :(
[20:30:10] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:930639|Remove GDI survey from RU and JA wikis. (T338926)]] (duration: 16m 30s)
[20:30:14] <stashbot>	 T338926: Undeploy Community Safety Survey from RU and JA Wikipedias (est. on or after June 14th) - https://phabricator.wikimedia.org/T338926
[20:30:22] <thcipriani>	 and, yeah, see it got downtimed earlier today for the same reason
[20:30:31] <thcipriani>	 ^ eigyan should be live everywhere
[20:30:56] <eigyan>	 I'll have a look thcipriani
[20:31:08] <thcipriani>	 thanks
[20:31:15] <thcipriani>	 MatmaRex: you're up
[20:31:23] <MatmaRex>	 yup
[20:31:31] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] HelpCompletionTool wasn't added to extension.json [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930541 (https://phabricator.wikimedia.org/T338254) (owner: 10Bartosz Dziewoński)
[20:31:37] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "Targets: Use align:'after' instead of actionGroups" [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930545 (https://phabricator.wikimedia.org/T339292) (owner: 10Bartosz Dziewoński)
[20:31:49] <thcipriani>	 and sorry I should have been backporting these the whole time
[20:31:58] <thcipriani>	 er...should have +2'd them
[20:32:02] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1149']
[20:32:27] <thcipriani>	 I'll jump ahead to Superpes while we wait for jenkins
[20:33:21] <thcipriani>	 Superpes: are you ready for deploy for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930682/ ?
[20:36:26] <MatmaRex>	 thcipriani: i think they said they had to leave, but i can verify that change
[20:36:43] <thcipriani>	 oh, ok, thanks MatmaRex going ahead
[20:37:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15)
[20:38:25] <wikibugs>	 (03Merged) 10jenkins-bot: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15)
[20:38:40] <eigyan>	 Thank you thcipriani all is well signing off for now...
[20:38:41] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:930682|[uzwiki] Add the 'patroller' usergroup (T338826)]]
[20:38:46] <stashbot>	 T338826: Request to activate patroller user group on uzwiki - https://phabricator.wikimedia.org/T338826
[20:40:03] <logmsgbot>	 !log thcipriani@deploy1002 superpes and thcipriani: Backport for [[gerrit:930682|[uzwiki] Add the 'patroller' usergroup (T338826)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:40:56] <thcipriani>	 ^ MatmaRex on mwdebug, check please
[20:41:04] <thcipriani>	 eigyan: thank you, see ya
[20:42:05] <MatmaRex>	 thcipriani: looks good, i see the group at https://uz.wikipedia.org/wiki/Maxsus:ListGroupRights as expected
[20:42:33] <thcipriani>	 cool, thank you for volunteering as tribute, going live
[20:42:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149']
[20:42:47] <wikibugs>	 (03PS1) 10Ottomata: refine - Use trailing / for schema base uris [puppet] - 10https://gerrit.wikimedia.org/r/930706 (https://phabricator.wikimedia.org/T335308)
[20:45:20] <MatmaRex>	 heh
[20:46:39] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] refine - Use trailing / for schema base uris [puppet] - 10https://gerrit.wikimedia.org/r/930706 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata)
[20:48:08] <wikibugs>	 (03PS1) 10Ottomata: refine_test - Use trailing / for schema base uris [puppet] - 10https://gerrit.wikimedia.org/r/930708 (https://phabricator.wikimedia.org/T335308)
[20:50:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1149']
[20:51:11] * thcipriani waits on timeouts for parse1002...
[20:52:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[20:53:40] <wikibugs>	 (03Merged) 10jenkins-bot: HelpCompletionTool wasn't added to extension.json [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930541 (https://phabricator.wikimedia.org/T338254) (owner: 10Bartosz Dziewoński)
[20:53:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Targets: Use align:'after' instead of actionGroups" [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930545 (https://phabricator.wikimedia.org/T339292) (owner: 10Bartosz Dziewoński)
[20:54:09] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:930682|[uzwiki] Add the 'patroller' usergroup (T338826)]] (duration: 15m 27s)
[20:54:12] <stashbot>	 T338826: Request to activate patroller user group on uzwiki - https://phabricator.wikimedia.org/T338826
[20:54:33] <thcipriani>	 ^ MatmaRex Superpes should be live everywhere now
[20:55:17] <MatmaRex>	 thanks
[20:55:25] <thcipriani>	 MatmaRex: any harm deploying both of these at the same time?
[20:55:30] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/929749 (https://phabricator.wikimedia.org/T335610) (owner: 10Cwhite)
[20:55:36] <MatmaRex>	 thcipriani: no, that should be okay
[20:55:52] <thcipriani>	 cool, I'll do that
[20:57:21] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:930545|Revert "Targets: Use align:'after' instead of actionGroups" (T339292)]], [[gerrit:930541|HelpCompletionTool wasn't added to extension.json (T338254)]]
[20:57:25] <stashbot>	 T339292: Issues with gadgets adding tools to VisualEditor "Page options" dropdown (ve.init.Target.actionGroups[1] is undefined) - https://phabricator.wikimedia.org/T339292
[20:57:26] <stashbot>	 T338254: Expose toolbar search feature in toolbar itself - https://phabricator.wikimedia.org/T338254
[20:58:45] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and matmarex: Backport for [[gerrit:930545|Revert "Targets: Use align:'after' instead of actionGroups" (T339292)]], [[gerrit:930541|HelpCompletionTool wasn't added to extension.json (T338254)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:59:00] <thcipriani>	 ^ should be on mwdebug, check please
[20:59:09] <MatmaRex>	 looking
[20:59:42] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:01:07] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10Eevans)
[21:01:08] <MatmaRex>	 thcipriani: both changes look good
[21:01:14] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:01:24] <thcipriani>	 MatmaRex: okie doke, going live everywhere
[21:01:58] <urandom>	 Decommission cassandra-a, cassandra-dev2001 — T339304
[21:01:58] <stashbot>	 T339304: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304
[21:02:14] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[21:03:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[21:08:29] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS bullseye
[21:08:36] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host cassandra-dev2001.codfw.wmnet with OS bullseye
[21:11:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150']
[21:11:51] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[21:12:03] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1150']
[21:12:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[21:12:48] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1150']
[21:13:30] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:930545|Revert "Targets: Use align:'after' instead of actionGroups" (T339292)]], [[gerrit:930541|HelpCompletionTool wasn't added to extension.json (T338254)]] (duration: 16m 09s)
[21:13:35] <stashbot>	 T339292: Issues with gadgets adding tools to VisualEditor "Page options" dropdown (ve.init.Target.actionGroups[1] is undefined) - https://phabricator.wikimedia.org/T339292
[21:13:35] <stashbot>	 T338254: Expose toolbar search feature in toolbar itself - https://phabricator.wikimedia.org/T338254
[21:13:41] <thcipriani>	 ^ alright MatmaRex all done
[21:13:48] <MatmaRex>	 thanks thcipriani
[21:13:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[21:14:02] <thcipriani>	 !log parse1002 having ssh connection problems during backport window
[21:14:03] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1150']
[21:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:12] <thcipriani>	 thanks for all the checking MatmaRex o/
[21:17:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[21:17:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150']
[21:19:01] <logmsgbot>	 !log jhancock@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[21:19:10] <logmsgbot>	 !log jhancock@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150']
[21:21:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[21:21:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150']
[21:24:16] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage
[21:26:42] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage
[21:28:24] <Superpes>	 Thanks thcipriani and MatmaRex :)
[21:28:34] <wikibugs>	 (03PS1) 10Phedenskog: Remove oversampling for Navigation Timing extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858)
[21:28:52] <Superpes>	 Unfortunately I had a sudden commitment!
[21:29:05] <thcipriani>	 Superpes: it happens, thanks for the patch
[21:29:57] <wikibugs>	 (03PS1) 10Ottomata: refine & spark_job - parameterize spark_submit executable path and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308)
[21:30:15] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] refine_test - Use trailing / for schema base uris [puppet] - 10https://gerrit.wikimedia.org/r/930708 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata)
[21:30:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1151']
[21:30:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] refine & spark_job - parameterize spark_submit executable path and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata)
[21:31:14] <wikibugs>	 (03PS2) 10Ottomata: refine & spark_job - parameterize spark_submit path and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308)
[21:31:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] refine & spark_job - parameterize spark_submit path and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata)
[21:34:33] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Investigate missing WikibaseQualityConstraints logs in logstash. - https://phabricator.wikimedia.org/T214031 (10colewhite) Might be related to how MediaWiki logging is configured?  Some messages get through like jobrunner and some messa...
[21:34:43] <wikibugs>	 (03PS3) 10Ottomata: refine - parameterize spark_submit and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308)
[21:35:37] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41757/console" [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata)
[21:36:58] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1 C: 03+2] refine - parameterize spark_submit and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata)
[21:39:47] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. Can be deployed any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) (owner: 10Phedenskog)
[21:40:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1151']
[21:40:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1152']
[21:41:53] <wikibugs>	 (03PS1) 10Ottomata: refine_sanitize - Fix typo in spark_submit path [puppet] - 10https://gerrit.wikimedia.org/r/930714 (https://phabricator.wikimedia.org/T335308)
[21:42:05] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] refine_sanitize - Fix typo in spark_submit path [puppet] - 10https://gerrit.wikimedia.org/r/930714 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata)
[21:50:30] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1152']
[21:59:47] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS bullseye
[21:59:52] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host cassandra-dev2001.codfw.wmnet with OS bullseye completed: - cassan...
[22:01:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153']
[22:08:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[22:12:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1153']
[22:13:13] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10Eevans) p:05Triage→03Medium
[22:14:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154']
[22:14:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[22:14:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[22:14:28] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye
[22:14:36] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host cassandra-dev2002.codfw.wmnet with OS bullseye
[22:17:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:18:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002"
[22:21:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002"
[22:21:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:28:08] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dancy)
[22:28:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1154']
[22:28:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1155']
[22:30:16] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[22:32:21] <wikibugs>	 (03PS1) 10Cwhite: backport orchestrator fields from ECS 8.8 [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881)
[22:33:15] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[22:38:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1155']
[22:38:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1156']
[22:40:56] <wikibugs>	 (03PS1) 10EoghanGaffney: registry: Add nginx logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/930719 (https://phabricator.wikimedia.org/T322579)
[22:43:28] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41758/console" [puppet] - 10https://gerrit.wikimedia.org/r/930719 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney)
[22:46:19] <wikibugs>	 (03PS1) 10Cathal Mooney: Modify Juniper ZTP shell script to use ed25519 keyword [puppet] - 10https://gerrit.wikimedia.org/r/930720 (https://phabricator.wikimedia.org/T336485)
[22:49:36] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1156']
[22:49:42] <wikibugs>	 (03CR) 10Papaul: [V: 03+1] Modify Juniper ZTP shell script to use ed25519 keyword [puppet] - 10https://gerrit.wikimedia.org/r/930720 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[22:50:09] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify Juniper ZTP shell script to use ed25519 keyword [puppet] - 10https://gerrit.wikimedia.org/r/930720 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[22:52:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1156']
[22:53:34] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1156']
[22:54:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1156']
[22:56:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[22:57:28] <wikibugs>	 (03PS1) 10Papaul: Add an-worker11[49-56] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/930724 (https://phabricator.wikimedia.org/T327295)
[23:00:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T339100 (10wiki_willy) a:03Jclark-ctr
[23:01:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Relabel: puppetserver1005 to puppetserver1001 - https://phabricator.wikimedia.org/T338326 (10wiki_willy) a:03Jclark-ctr
[23:02:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1156']
[23:02:50] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1155']
[23:03:19] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add an-worker11[49-56] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/930724 (https://phabricator.wikimedia.org/T327295) (owner: 10Papaul)
[23:07:16] <wikibugs>	 (03PS1) 10Cathal Mooney: Allow MGMT ranges to make TFTP requests to install server [puppet] - 10https://gerrit.wikimedia.org/r/930727 (https://phabricator.wikimedia.org/T336485)
[23:08:49] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339264 (10Jclark-ctr) 05Open→03Resolved link restored on servers in C6
[23:09:33] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1155']
[23:10:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154']
[23:10:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1154']
[23:12:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153']
[23:16:02] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye
[23:16:09] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host cassandra-dev2002.codfw.wmnet with OS bullseye completed: - cassan...
[23:20:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1153']
[23:20:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1152']
[23:20:44] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1152']
[23:21:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1151']
[23:21:18] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1151']
[23:23:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154']
[23:24:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1154']
[23:26:22] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[23:30:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154']
[23:30:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1154']
[23:31:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[23:31:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150']
[23:37:00] <wikibugs>	 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul)
[23:37:22] <wikibugs>	 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul) p:05Triage→03Medium
[23:38:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[23:39:14] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10Eevans)
[23:42:28] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS bullseye
[23:42:34] <wikibugs>	 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host cassandra-dev2003.codfw.wmnet with OS bullseye
[23:42:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[23:43:05] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150']
[23:43:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153']
[23:44:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1153']
[23:44:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153']
[23:44:58] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['an-worker1153']
[23:45:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet
[23:45:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[23:46:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150']
[23:46:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150']
[23:47:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:47:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a1-codfw.mgmt.codfw.wmnet
[23:51:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a2-codfw.mgmt.codfw.wmnet
[23:51:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[23:52:27] <wikibugs>	 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul)
[23:54:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a2-codfw - pt1979@cumin2002"
[23:55:20] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:55:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a2-codfw - pt1979@cumin2002"
[23:55:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:55:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149']
[23:56:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a2-codfw.mgmt.codfw.wmnet
[23:56:38] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:57:24] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1149']
[23:58:39] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage