[00:00:54] <subbu>	 It is Template:Wikidata Infobox/core with lots of calls to Module:WikidataIB ... but haven't found any edits to any of those. Info is probably there in changeprop / jobqueue jobs somewhere .. need a way to log changes that trigger a large volume of reparse events if it isn't already there somewhere.
[00:02:01] <Krinkle>	 subbu: every job logs a reqId in logstash, which should in theory identify the original edit that started the chain
[00:02:10] <Krinkle>	 since we preserve and pass that on
[00:02:35] <subbu>	 I see .. good to know. I need to learn about this sleuthing.
[00:02:36] <wikibugs>	 (03PS1) 10Cwhite: logstash: add loki output support [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826)
[00:02:48] <Krinkle>	 https://logstash.wikimedia.org/goto/c3de970e794549bd76a74fc766cc7254
[00:03:02] <Krinkle>	 I clicked on a random one of those reqId fields
[00:03:14] <Krinkle>	 250,000 entries matching that reqId
[00:04:16] <Krinkle>	 I'm confused as to how we have a job that triggers parsoid api requests though
[00:04:27] <Krinkle>	 what kind of a job does that?
[00:04:36] <subbu>	 Template:Wikidata Infobox/i18n/en] ?
[00:04:45] <Krinkle>	 TranslateRenderJob
[00:04:52] <Krinkle>	 Indeed, that's the first log message
[00:06:51] <subbu>	 as for parsoid reparses ... the chain is: changeprop -> restbase -> parsoid.
[00:06:53] <Krinkle>	 okay, the full story is at https://logstash.wikimedia.org/goto/7e6fae652e7aa6bfa71099f1db98ee6f
[00:07:15] <Krinkle>	 the slow-log dashboard isn't a good one to try and repurpose into seeing all message as some of its panels have unmodifiable filters
[00:07:27] <Krinkle>	 moved the reqId search to the general mediawiki dashboard instead
[00:08:08] <Krinkle>	 it started with /w/index.php?title=Translations:Template:Wikidata_Infobox/i18n/msg-search-depicted/en&action=submit
[00:10:15] <Krinkle>	 Jun 28, 2022 @ 18:58
[00:10:23] <subbu>	 oye! one translation updated and the whole firestorm started.
[00:10:57] <Krinkle>	 this edit then queued a job which ran to completion on a jobrunner (18:58:01) `TranslateRenderJob [Template:Wikidata Infobox/i18n/en]: Finished TranslateRenderJob`
[00:11:38] <Krinkle>	 Then six seconds later we see a storm of api.php requests for reasons unclear to me, many of which log errors like `Pool key 'commonswiki:pcache:idhash: .. ' (ArticleView): Usage error: You may only aquire a single non-nowait lock.`
[00:13:31] <Krinkle>	 an hour later we see one entry from a parsoid server: `/w/rest.php/commons.wikimedia.org/v3/page/pagebundle/Koch_snowflake/619632125`
[00:13:40] <Krinkle>	 again, same reqId chain still.
[00:13:49] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service,man-db.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:14:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2157.codfw.wmnet with reason: host reimage
[00:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:52] <subbu>	 my brain is fried here ... i'll have to step away for a bit and will look back here to see what you all find and if there is any action needed on our end.
[00:18:47] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2157.codfw.wmnet with reason: host reimage
[00:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:02] <Krinkle>	 I'm guessing changeprop isn't just pregenerating a parsoid result in restbase for every edit, but also propagating template edits in its own custom way based on template links information from somewhere, and presumably not in a way that honours the ten years of optimisations we applied to refreshlinks in MW core, nor anything else jobqueue related.
[00:20:19] <Krinkle>	 One issue at least that seems in need of investigating further is that these API requests (which I'm guessing start independently from changeprop, I wasn't aware that changeprop knew the edit reqId and re-used it the same way as our jobrunner, thats useful actually, I'm curious where it gets that reqId from though given mw core doesn't trigger that afaik). - that these API requests are managing to very often trigger a poolcounter 
[00:20:19] <Krinkle>	 error, that shouldn't happen under normal circumstances.
[00:22:07] <Krinkle>	 those API requests are also, again for unknown reasons, intiaiting MW sessions, and logging things like `Failed to fetch commonswiki:MWSession:…… : (503) `
[00:22:28] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:23:14] <icinga-wm>	 RECOVERY - Check systemd state on es2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:54] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[00:32:02] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2157.codfw.wmnet with OS bullseye
[00:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:10] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2157.codfw.wmnet with OS bullseye completed: - db2...
[00:32:45] <wikibugs>	 (03PS9) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015)
[00:33:13] <ebernhardson>	 checking into cirrus failures alert
[00:37:53] <wikibugs>	 (03PS10) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015)
[00:45:21] <wikibugs>	 (03CR) 10DDesouza: "Reduced coverage to exercise caution because we will not be able to take it down during the weekend and we want to disable the survey as s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[00:49:53] <ebernhardson>	 !log T310924 Cleared eqiad chi->omega cross cluster settings and reapplied
[00:49:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:59] <stashbot>	 T310924: Investigate CirrusSearch eqiad failures - https://phabricator.wikimedia.org/T310924
[00:57:18] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[00:58:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2158.codfw.wmnet with OS bullseye
[00:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:11] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2158.codfw.wmnet with OS bullseye
[00:58:55] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2155.codfw.wmnet with OS bullseye
[00:58:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:00] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye executed with er...
[01:07:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2159.codfw.wmnet with OS bullseye
[01:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:07:36] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2159.codfw.wmnet with OS bullseye
[01:17:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2158.codfw.wmnet with reason: host reimage
[01:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:05] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[01:20:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2158.codfw.wmnet with reason: host reimage
[01:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:34] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:27:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2159.codfw.wmnet with reason: host reimage
[01:27:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:32:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2159.codfw.wmnet with reason: host reimage
[01:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2158.codfw.wmnet with OS bullseye
[01:36:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:48] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2158.codfw.wmnet with OS bullseye completed: - db2...
[01:48:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2159.codfw.wmnet with OS bullseye
[01:48:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:57] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2159.codfw.wmnet with OS bullseye completed: - db2...
[01:59:57] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:05:57] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[02:11:00] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bullseye
[02:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:11:07] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2160.codfw.wmnet with OS bullseye
[02:11:47] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[02:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:15:16] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[02:15:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:16:59] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:17:19] <logmsgbot>	 !log bmansurov@deploy1002 deploy aborted: (no justification provided) (duration: 02m 03s)
[02:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:18:34] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[02:18:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:18:43] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[02:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:41:55] <wikibugs>	 (03PS2) 10KartikMistry: Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson)
[02:47:53] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[02:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:47:57] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 03s)
[02:47:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:48:40] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[02:48:42] <logmsgbot>	 !log bmansurov@deploy1002 deploy aborted: (no justification provided) (duration: 00m 02s)
[02:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:48:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:48:53] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[02:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:49:01] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[02:49:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:49:56] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[02:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:50:04] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[02:50:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:59:17] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2160.codfw.wmnet with OS bullseye
[02:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:59:23] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2160.codfw.wmnet with OS bullseye executed with er...
[03:18:23] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:32:03] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:35:59] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:41:41] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:27] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:07] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:48:57] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:05:29] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:17:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T300472
[05:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:09] <stashbot>	 T300472: Switchover x1 master (db1103 -> db1120) - https://phabricator.wikimedia.org/T300472
[05:17:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T300472
[05:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1120 with weight 0 T300472', diff saved to https://phabricator.wikimedia.org/P30632 and previous config saved to /var/cache/conftool/dbconfig/20220630-051730-root.json
[05:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:11] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472)
[05:25:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472) (owner: 10Marostegui)
[05:26:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) Thank you!
[05:32:45] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/809700
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T0600).
[06:00:15] <marostegui>	 o/
[06:01:59] <marostegui>	 Anyone else around?
[06:03:13] <Amir1>	 marostegui: I am
[06:03:18] <marostegui>	 Amir1: o/
[06:03:20] <marostegui>	 Let's start then
[06:03:24] <marostegui>	 !log Starting x1 eqiad failover from db1103 to db1120 - T300472
[06:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:30] <stashbot>	 T300472: Switchover x1 master (db1103 -> db1120) - https://phabricator.wikimedia.org/T300472
[06:03:50] <marostegui>	 Reminder: there is no way to put MW on RO for x1
[06:03:57] <marostegui>	 So I will do it directly to the master on mysql
[06:05:10] <marostegui>	 all done
[06:05:26] <marostegui>	 Amir1: Can you try to generate a write on x1?
[06:05:48] <Amir1>	 Sure
[06:06:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1120 to x1 primary and set section read-write T300472', diff saved to https://phabricator.wikimedia.org/P30633 and previous config saved to /var/cache/conftool/dbconfig/20220630-060601-root.json
[06:06:06] <marostegui>	 ^ now
[06:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:31] <marostegui>	 I see connections on the new master
[06:06:48] <Amir1>	 marostegui: it works
[06:06:50] <Amir1>	 https://w.wiki/5NUn
[06:07:34] <marostegui>	 great!
[06:11:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103 T300472', diff saved to https://phabricator.wikimedia.org/P30634 and previous config saved to /var/cache/conftool/dbconfig/20220630-061140-root.json
[06:11:55] <marostegui>	 Everything looks fine
[06:12:02] <Amir1>	 Wohoo
[06:12:11] <Amir1>	 Thanks!
[06:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:16:26] <wikibugs>	 (03PS1) 10Marostegui: db1103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809878 (https://phabricator.wikimedia.org/T300099)
[06:21:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809878 (https://phabricator.wikimedia.org/T300099) (owner: 10Marostegui)
[06:23:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::php: fix proxy selection when using unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/809879
[06:25:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1103.eqiad.wmnet with OS bullseye
[06:33:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/809700 (owner: 10Marostegui)
[06:33:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1103.eqiad.wmnet with reason: host reimage
[06:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30635 and previous config saved to /var/cache/conftool/dbconfig/20220630-063622-root.json
[06:36:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1103.eqiad.wmnet with reason: host reimage
[06:37:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36134/console" [puppet] - 10https://gerrit.wikimedia.org/r/809879 (owner: 10Giuseppe Lavagetto)
[06:39:03] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:46:01] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:51:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 2%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30636 and previous config saved to /var/cache/conftool/dbconfig/20220630-065126-root.json
[06:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:41] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:54:09] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:54:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1103.eqiad.wmnet with OS bullseye
[06:56:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30637 and previous config saved to /var/cache/conftool/dbconfig/20220630-065621-root.json
[06:56:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede)
[06:56:25] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/809887
[06:56:27] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] profile::prometheus::ops enable Ganeti metric scraping. [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede)
[06:57:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/809887 (owner: 10Marostegui)
[06:58:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P30638 and previous config saved to /var/cache/conftool/dbconfig/20220630-065857-root.json
[06:59:46] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::php: fix proxy selection when using unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/809879
[07:00:05] <jouncebot>	 Amir1 and apergos: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T0700).
[07:00:17] <apergos>	 hello everybody!
[07:00:24] <apergos>	 there are no trainees signed up for today's window
[07:00:38] <apergos>	 and that's a good thing because there are also no patches scheduled for deployment :-D
[07:01:00] <apergos>	 if anyone wants to step up and self deploy, now's the time, in about 15 minutes I'm going to wander off.
[07:11:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36135/console" [puppet] - 10https://gerrit.wikimedia.org/r/809879 (owner: 10Giuseppe Lavagetto)
[07:11:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 2%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30639 and previous config saved to /var/cache/conftool/dbconfig/20220630-071125-root.json
[07:11:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36136/console" [puppet] - 10https://gerrit.wikimedia.org/r/809879 (owner: 10Giuseppe Lavagetto)
[07:15:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::php: fix proxy selection when using unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/809879 (owner: 10Giuseppe Lavagetto)
[07:15:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 weight', diff saved to https://phabricator.wikimedia.org/P30640 and previous config saved to /var/cache/conftool/dbconfig/20220630-071522-marostegui.json
[07:15:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P30641 and previous config saved to /var/cache/conftool/dbconfig/20220630-071526-root.json
[07:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807983 (owner: 10Slyngshede)
[07:17:11] <apergos>	 welp 15 minutes later no one has stepped up so that's it for today
[07:18:55] <wikibugs>	 10SRE, 10DNS, 10Infrastructure-Foundations, 10Mail, and 2 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo)
[07:19:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See also extended rationale in the task, I don't think this is necessary" [puppet] - 10https://gerrit.wikimedia.org/r/808040 (https://phabricator.wikimedia.org/T311262) (owner: 10Herron)
[07:21:44] <wikibugs>	 10SRE, 10DNS, 10Infrastructure-Foundations, 10Mail, and 2 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo) I created this when I saw someone mentioning it on discord. Ping @Vgutierrez @BBlack (I personally have no thought, I didn't know this was a th...
[07:21:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[07:24:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[07:25:51] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[07:26:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30642 and previous config saved to /var/cache/conftool/dbconfig/20220630-072629-root.json
[07:26:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:47] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[07:26:54] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) p:05Triage→03Medium
[07:26:58] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) p:05Triage→03Medium
[07:30:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 2%: After reimage', diff saved to https://phabricator.wikimedia.org/P30643 and previous config saved to /var/cache/conftool/dbconfig/20220630-073030-root.json
[07:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:48] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] class role::apt_repo switch apt-repo to Apache2, from nginx. [puppet] - 10https://gerrit.wikimedia.org/r/807983 (owner: 10Slyngshede)
[07:36:26] <icinga-wm>	 RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[07:36:39] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: am: add 'host' label and add port to 'instance' [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763460 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi)
[07:37:34] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:41:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30644 and previous config saved to /var/cache/conftool/dbconfig/20220630-074133-root.json
[07:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:32] <slyngs>	 !log Move apt repository to Apache2, from Nginx https://gerrit.wikimedia.org/r/c/operations/puppet/+/807983
[07:42:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P30645 and previous config saved to /var/cache/conftool/dbconfig/20220630-074534-root.json
[07:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:14] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8037616, @elukey wrote: > If everybody agrees I'd keep Buster for the moment, and possibly ML could be the first cluster to be upgraded when Cassadra 4 is import...
[07:52:47] <wikibugs>	 10SRE-swift-storage: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi)
[07:53:14] <wikibugs>	 (03PS1) 10Aklapper: Phabricator: Remove unneeded translation overrides [puppet] - 10https://gerrit.wikimedia.org/r/809907 (https://phabricator.wikimedia.org/T309746)
[07:55:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Marostegui)
[07:56:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30646 and previous config saved to /var/cache/conftool/dbconfig/20220630-075637-root.json
[07:56:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:00] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) Definitely :)  The main worry that I have now is that moving to Bullseye for Cassandra nodes will mean upgrading to 4.x at this point, unless we find a way to move cqlsh.py to python 3 in...
[08:00:05] <jouncebot>	 dduvall and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T0800).
[08:00:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P30647 and previous config saved to /var/cache/conftool/dbconfig/20220630-080038-root.json
[08:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:32] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:17] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) I checked in the jira that was pointed out earlier, and I noticed two things:  1) Most of the subtasks are related to finding how to test things with python3 etc.. 2) All the discussions...
[08:10:32] <icinga-wm>	 PROBLEM - HTTPS on apt1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/APT_repository
[08:11:23] <vgutierrez>	 slyngs: ^^
[08:11:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30648 and previous config saved to /var/cache/conftool/dbconfig/20220630-081140-root.json
[08:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:25] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache2001.codfw.wmnet with OS buster
[08:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P30649 and previous config saved to /var/cache/conftool/dbconfig/20220630-081542-root.json
[08:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:00] <logmsgbot>	 !log elukey@deploy1002 Started deploy [ores/deploy@dfaec93]: Update ores submodule to its latest commit and scap canary settings
[08:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 (10ayounsi) 05Open→03Resolved a:03ayounsi Awesome, thanks a lot @ssingh   I slightly cleaned up the doc (added a mention of the bird2 upgrade) And updated the dashboard at https://g...
[08:20:38] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: wmflib::service::get_url: avoid using monitoring to find the url. [puppet] - 10https://gerrit.wikimedia.org/r/800010
[08:22:35] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10Marostegui) 05Open→03Resolved Data looks fine, resolving.
[08:24:42] <wikibugs>	 (03PS1) 10Slyngshede: P:aptrepo::wikimedia enable OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/809911
[08:24:56] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:26:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage
[08:26:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30650 and previous config saved to /var/cache/conftool/dbconfig/20220630-082644-root.json
[08:26:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:27] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache2002.codfw.wmnet with OS buster
[08:28:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage
[08:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:53] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36137/console" [puppet] - 10https://gerrit.wikimedia.org/r/809911 (owner: 10Slyngshede)
[08:29:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: switch ip_allow.config to YAML format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[08:30:01] <wikibugs>	 10SRE, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10greg) The email team in fundraising has interest in this topic as well.
[08:30:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[08:30:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P30651 and previous config saved to /var/cache/conftool/dbconfig/20220630-083046-root.json
[08:30:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:54] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:48] <logmsgbot>	 !log elukey@deploy1002 Finished deploy [ores/deploy@dfaec93]: Update ores submodule to its latest commit and scap canary settings (duration: 14m 48s)
[08:33:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913)
[08:33:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff)
[08:35:32] <wikibugs>	 10SRE, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo) Probably related: T211404 T167337
[08:38:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] wmflib::service::get_url: avoid using monitoring to find the url. [puppet] - 10https://gerrit.wikimedia.org/r/800010 (owner: 10Giuseppe Lavagetto)
[08:39:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ores: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809625 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:40:10] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:52] <wikibugs>	 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) @MatthewVernon hi! Do you have any guidance about how to proceed?
[08:41:15] <wikibugs>	 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey)
[08:41:28] <wikibugs>	 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey)
[08:41:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30652 and previous config saved to /var/cache/conftool/dbconfig/20220630-084148-root.json
[08:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:01] <wikibugs>	 (03PS2) 10Muehlenhoff: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913)
[08:42:02] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage
[08:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:46] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache2003.codfw.wmnet with OS buster
[08:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:54] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage
[08:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P30653 and previous config saved to /var/cache/conftool/dbconfig/20220630-084550-root.json
[08:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from x1 master - not neeed anymore', diff saved to https://phabricator.wikimedia.org/P30654 and previous config saved to /var/cache/conftool/dbconfig/20220630-084621-marostegui.json
[08:46:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:12] <wikibugs>	 (03PS3) 10Muehlenhoff: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913)
[08:56:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage
[08:56:26] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage
[08:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2001.codfw.wmnet with OS buster
[08:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:55] <wikibugs>	 (03PS54) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040)
[09:07:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable component/ganeti3 for codfw [puppet] - 10https://gerrit.wikimedia.org/r/809920 (https://phabricator.wikimedia.org/T311686)
[09:10:32] <wikibugs>	 (03PS1) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[09:16:08] <wikibugs>	 (03Abandoned) 10Slyngshede: P:aptrepo::wikimedia enable OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/809911 (owner: 10Slyngshede)
[09:17:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable component/ganeti3 for codfw [puppet] - 10https://gerrit.wikimedia.org/r/809920 (https://phabricator.wikimedia.org/T311686) (owner: 10Muehlenhoff)
[09:19:17] <icinga-wm>	 ACKNOWLEDGEMENT - HTTPS on apt1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response Slyngshede waiting for rollback https://wikitech.wikimedia.org/wiki/APT_repository
[09:26:21] <wikibugs>	 (03PS1) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923
[09:28:45] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36139/console" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[09:28:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede)
[09:32:21] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Blackify python files [software/benchmw] - 10https://gerrit.wikimedia.org/r/809924
[09:34:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add --cookie command line option [software/benchmw] - 10https://gerrit.wikimedia.org/r/809925
[09:34:51] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2083 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/809926 (https://phabricator.wikimedia.org/T311695)
[09:35:07] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36140/console" [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede)
[09:35:08] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on sretest1001.eqiad.wmnet with reason: Testing
[09:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:21] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on sretest1001.eqiad.wmnet with reason: Testing
[09:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:04] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2002.codfw.wmnet with OS buster
[09:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2083 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/809926 (https://phabricator.wikimedia.org/T311695) (owner: 10Marostegui)
[09:38:42] <wikibugs>	 (03PS2) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923
[09:41:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede)
[09:42:01] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:42:17] <wikibugs>	 (03PS3) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923
[09:42:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2083 from dbctl', diff saved to https://phabricator.wikimedia.org/P30655 and previous config saved to /var/cache/conftool/dbconfig/20220630-094239-marostegui.json
[09:42:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:49] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] Implement MediaWiki multi-DC traffic component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[09:44:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede)
[09:47:50] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2003.codfw.wmnet with OS buster
[09:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:17] <wikibugs>	 (03PS4) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923
[09:55:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede)
[09:56:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Blackify python files [software/benchmw] - 10https://gerrit.wikimedia.org/r/809924 (owner: 10Giuseppe Lavagetto)
[09:59:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add --cookie command line option [software/benchmw] - 10https://gerrit.wikimedia.org/r/809925 (owner: 10Giuseppe Lavagetto)
[10:00:04] <jouncebot>	 mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1000)
[10:03:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Blackify python files [software/benchmw] - 10https://gerrit.wikimedia.org/r/809924 (owner: 10Giuseppe Lavagetto)
[10:06:46] <wikibugs>	 (03PS1) 10Stang: Fixes Content sub unreadable in Vector 22 [skins/Vector] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809890 (https://phabricator.wikimedia.org/T311564)
[10:08:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the addition" [cookbooks] - 10https://gerrit.wikimedia.org/r/809599 (https://phabricator.wikimedia.org/T311593) (owner: 10Muehlenhoff)
[10:10:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add --cookie command line option [software/benchmw] - 10https://gerrit.wikimedia.org/r/809925 (owner: 10Giuseppe Lavagetto)
[10:12:45] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[10:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:17:16] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] multiversion: Move missing.php from wmf-config/ to /multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807610 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle)
[10:17:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] missing.php: Update docs and add test plan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807609 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle)
[10:24:23] <wikibugs>	 (03PS5) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923
[10:28:19] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[10:31:10] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:33:34] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:35:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede)
[10:37:44] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede)
[10:38:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I don't see it used either, and I don't recall the context." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806908 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi)
[10:40:48] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:40:52] <wikibugs>	 (03PS11) 10Volans: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi)
[10:42:32] <wikibugs>	 (03PS1) 10MarcoAurelio: Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359)
[10:44:56] <icinga-wm>	 PROBLEM - DPKG on apt2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[10:49:20] <icinga-wm>	 RECOVERY - HTTPS on apt1001 is OK: SSL OK - OCSP staple validity for apt.wikimedia.org has 281440 seconds left:Certificate apt.wikimedia.org valid until 2022-08-08 04:49:26 +0000 (expires in 38 days) https://wikitech.wikimedia.org/wiki/APT_repository
[10:51:42] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:55:16] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:56:52] <icinga-wm>	 PROBLEM - DPKG on apt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[10:57:45] <wikibugs>	 (03CR) 10Vgutierrez: Revert "Cache Badtitle 400s for 60s in varnish-fe" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm)
[10:59:06] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on apt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Slyngshede Issue installing an reinstalling nginx https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[10:59:06] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on apt2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Slyngshede Issue installing an reinstalling nginx https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[11:09:40] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:16:22] <icinga-wm>	 RECOVERY - DPKG on apt2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[11:23:38] <wikibugs>	 (03PS1) 10Slyngshede: P:puppet:agent run puppet agent one minute after boot. [puppet] - 10https://gerrit.wikimedia.org/r/809943
[11:28:18] <icinga-wm>	 RECOVERY - DPKG on apt1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[11:33:40] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:35:14] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36141/console" [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede)
[11:38:24] <wikibugs>	 (03CR) 10Slyngshede: P:puppet:agent run puppet agent one minute after boot. [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede)
[11:43:18] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:43:24] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P30657 and previous config saved to /var/cache/conftool/dbconfig/20220630-114419-ladsgroup.json
[11:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:43] <wikibugs>	 10SRE-swift-storage: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) A sample of said utilization (on thanos-be2002, other hosts are similar)  {F35289121}
[11:47:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Disable swap before running wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/809599 (https://phabricator.wikimedia.org/T311593) (owner: 10Muehlenhoff)
[11:48:16] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:51:30] <wikibugs>	 (03PS6) 10Filippo Giunchedi: icinga: remove 'monitoring' from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946)
[11:54:41] <wikibugs>	 (03Abandoned) 10Kosta Harlan: Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809549 (https://phabricator.wikimedia.org/T311467) (owner: 10Kosta Harlan)
[11:57:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: heavier banhammer for tegola object-server 'access logs' [puppet] - 10https://gerrit.wikimedia.org/r/809966 (https://phabricator.wikimedia.org/T297959)
[11:59:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P30658 and previous config saved to /var/cache/conftool/dbconfig/20220630-115923-ladsgroup.json
[11:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I noticed thanos-be2001 filling up root FS with logs again, turns out (in retrospect obviously) that we were banning based on container na" [puppet] - 10https://gerrit.wikimedia.org/r/809966 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi)
[12:02:10] <wikibugs>	 10SRE, 10SRE Observability, 10User-fgiunchedi: systemd state on thanos-fe1001 is flapping - https://phabricator.wikimedia.org/T311322 (10fgiunchedi) I think I tracked the root cause down, an entry for `thanos-swift.discovery.wmnet` pointing to codfw was present in `/etc/hosts`. TBH I can't remember if I did...
[12:02:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I believe this is good to be merged now!" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[12:10:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede)
[12:11:27] <wikibugs>	 (03PS1) 10Ssingh: bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968
[12:12:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh)
[12:12:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Paul Norman to contributors [puppet] - 10https://gerrit.wikimedia.org/r/809629 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:12:56] <wikibugs>	 (03PS2) 10Ssingh: bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968
[12:14:10] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36143/console" [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh)
[12:14:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P30659 and previous config saved to /var/cache/conftool/dbconfig/20220630-121427-ladsgroup.json
[12:14:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:41] <wikibugs>	 (03CR) 10Ssingh: bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh)
[12:15:09] <wikibugs>	 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Papaul)
[12:15:32] <wikibugs>	 (03PS1) 10Slyngshede: P:aptrepo::wikimedia move private repo to nginx and uninstall apache [puppet] - 10https://gerrit.wikimedia.org/r/809969
[12:16:32] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[12:17:07] <wikibugs>	 (03CR) 10Urbanecm: "code looks good, i just want to highlight an approval required by a comment (can't find a sign of approval on the patch or task)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson)
[12:17:26] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[12:18:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[12:20:59] <wikibugs>	 (03PS2) 10Slyngshede: P:puppet:agent run puppet agent one minute after startup. [puppet] - 10https://gerrit.wikimedia.org/r/809943
[12:23:45] <wikibugs>	 (03PS2) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[12:23:47] <wikibugs>	 (03PS55) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040)
[12:25:39] <wikibugs>	 (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[12:26:14] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36144/console" [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede)
[12:26:20] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "(removed +1 by mistake that PCC added)" [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh)
[12:29:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P30660 and previous config saved to /var/cache/conftool/dbconfig/20220630-122931-ladsgroup.json
[12:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:42] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[12:30:04] <wikibugs>	 (03CR) 10David Caro: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[12:31:18] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[12:31:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede)
[12:31:59] <wikibugs>	 (03PS56) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040)
[12:32:01] <wikibugs>	 (03PS3) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[12:32:42] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[12:35:14] <wikibugs>	 (03PS11) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015)
[12:36:36] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db2083 [puppet] - 10https://gerrit.wikimedia.org/r/809975 (https://phabricator.wikimedia.org/T311695)
[12:36:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2083.codfw.wmnet
[12:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:35] <wikibugs>	 (03CR) 10DDesouza: "Fixed spaces being used for indentation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[12:39:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2083 [puppet] - 10https://gerrit.wikimedia.org/r/809975 (https://phabricator.wikimedia.org/T311695) (owner: 10Marostegui)
[12:39:45] <wikibugs>	 (03PS1) 10Papaul: ADD new PDU model to ps1-a4-codfw [puppet] - 10https://gerrit.wikimedia.org/r/809977 (https://phabricator.wikimedia.org/T309957)
[12:40:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[12:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:27] <wikibugs>	 (03CR) 10KartikMistry: Enable Wikistories on idwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson)
[12:42:13] <kart_>	 James_F: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/809165 - can you take a look when you're available.
[12:44:36] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:44:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2083.codfw.wmnet
[12:46:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:28] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Marostegui) a:03Papaul
[12:47:39] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Marostegui) @Papaul this is ready!
[12:47:47] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Marostegui)
[12:49:15] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Papaul)
[12:53:05] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:puppet:agent run puppet agent one minute after startup. [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede)
[12:54:22] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:55:35] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983
[12:58:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983 (owner: 10Volans)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1300).
[13:00:05] <jouncebot>	 koi and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:29] <koi>	 hi
[13:00:33] <moritzm>	 !log uploaded php-defaults 76+wmf1+buster2 for component/php74 (drops a Breaks: on php72-common)  T311386
[13:00:39] <urbanecm>	 hi! i can deploy today
[13:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:41] <stashbot>	 T311386: Install php 7.4 in production - https://phabricator.wikimedia.org/T311386
[13:01:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fixes Content sub unreadable in Vector 22 [skins/Vector] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809890 (https://phabricator.wikimedia.org/T311564) (owner: 10Stang)
[13:01:59] <urbanecm>	 koi: I'll let you know once this is ready to be tested
[13:02:01] <kostajh>	 hi, i'm here
[13:02:04] <urbanecm>	 hi kostajh 
[13:02:10] <koi>	 got it, thanks
[13:02:26] <wikibugs>	 (03PS5) 10Urbanecm: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse)
[13:02:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse)
[13:04:23] <wikibugs>	 (03Merged) 10jenkins-bot: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse)
[13:04:45] <urbanecm>	 kostajh: pulled to mwdebug1001, can you have a look?
[13:04:50] <kostajh>	 urbanecm: yep, one moment
[13:04:50] <wikibugs>	 (03PS2) 10Volans: sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983
[13:05:48] <moritzm>	 !log upgrade mwdebug* servers to 2:76+wmf1~buster2 T311386
[13:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:54] <stashbot>	 T311386: Install php 7.4 in production - https://phabricator.wikimedia.org/T311386
[13:07:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:49] <kostajh>	 urbanecm: lgtm
[13:08:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:08:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:57] <urbanecm>	 kostajh: thanks, syncing
[13:09:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:09:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:13] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED
[13:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:27] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM,PCI ID looks correct: https://pci-ids.ucw.cz/read/PC/1000/10e2" [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff)
[13:11:03] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983 (owner: 10Volans)
[13:12:12] <wikibugs>	 (03PS1) 10David Caro: wmcs.puppet_alert: properly check if files exist [puppet] - 10https://gerrit.wikimedia.org/r/809987
[13:12:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Allow mwbuilder group to access mwdeploy key [puppet] - 10https://gerrit.wikimedia.org/r/809712 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[13:13:07] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: fb399065b123db536ae244a0c0fada61eb906a6e: Structured task: enable free text for "other" rejection reason (T304099) (duration: 03m 46s)
[13:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:13] <stashbot>	 T304099: Structured tasks: temporary free text for "other" rejection reason - https://phabricator.wikimedia.org/T304099
[13:13:51] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983 (owner: 10Volans)
[13:14:10] <urbanecm>	 kostajh: and, should be live. anything else?
[13:14:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:14:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) Hi @Cmjohnson - that's really interesting. I think that you're one step closer to a working system than I am, but ultimately I think...
[13:14:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:24] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36145/console" [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede)
[13:14:30] <kostajh>	 urbanecm: that's all, thank you
[13:14:35] <urbanecm>	 no problem :)
[13:14:49] <urbanecm>	 kostajh: fwiw i don't see the other box at cswiki, but i guess that's because it's wmf.17 i guess?
[13:15:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:15:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:39] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+1] wmcs.puppet_alert: properly check if files exist [puppet] - 10https://gerrit.wikimedia.org/r/809987 (owner: 10David Caro)
[13:16:58] <moritzm>	 !log installing libsndfile security updates
[13:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:46] <wikibugs>	 (03Merged) 10jenkins-bot: Fixes Content sub unreadable in Vector 22 [skins/Vector] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809890 (https://phabricator.wikimedia.org/T311564) (owner: 10Stang)
[13:18:54] <wikibugs>	 (03CR) 10Slavina Stefanova: [C: 03+1] wmcs.puppet_alert: properly check if files exist [puppet] - 10https://gerrit.wikimedia.org/r/809987 (owner: 10David Caro)
[13:19:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bullseye
[13:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:15] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye
[13:19:52] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Thanks both!" [puppet] - 10https://gerrit.wikimedia.org/r/809987 (owner: 10David Caro)
[13:19:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2155.codfw.wmnet with reason: host reimage
[13:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:01] <urbanecm>	 koi: your patch is at mwdebug1001, can you check?
[13:21:06] <koi>	 looking
[13:21:54] <koi>	 urbanecm: LGTM
[13:21:58] <urbanecm>	 thanks, syncing
[13:22:41] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:50] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s)
[13:22:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:17] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:25] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[13:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2155.codfw.wmnet with reason: host reimage
[13:23:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:24:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:25:51] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.18/skins/Vector/resources/skins.vector.styles/layouts/screen.less: a927e6fbf56f031c42737cd9710eb0531bab43e1: Fixes Content sub unreadable in Vector 22 (T311564) (duration: 03m 18s)
[13:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:59] <stashbot>	 T311564: Content sub unreadable in Vector 22 - https://phabricator.wikimedia.org/T311564
[13:26:01] <urbanecm>	 koi: and it's live. anything else?
[13:26:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:26:08] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:26:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:16] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[13:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:54] <kostajh>	 urbanecm: correct, the config patch requires wmf.18 to take effect
[13:27:07] <koi>	 urbanecm: one question, is bacc window a proper place to schedule a maintenance script run?
[13:27:45] <urbanecm>	 koi: on that, you need to sync with someone who can run them. if it's a simple one, i can run it now :)
[13:28:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host stat1009.eqiad.wmnet with OS buster
[13:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS buster
[13:29:30] <koi>	 urbanecm: cool! would you like to have a look at T311012?
[13:29:31] <stashbot>	 T311012: Attach account "New user message" to its global account - https://phabricator.wikimedia.org/T311012
[13:29:41] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet
[13:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host poolcounter2004.codfw.wmnet
[13:31:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:31] <urbanecm>	 !log [urbanecm@mwmaint1002 /srv/mediawiki/php]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=rowiki --userlist /home/urbanecm/users.txt # T311012, users.txt has `New user message` only
[13:32:34] <urbanecm>	 koi: here you go :)
[13:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:26] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:29] <koi>	 thanks a lot :)
[13:33:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host stat1009.eqiad.wmnet with OS buster
[13:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS buster executed with errors: - stat100...
[13:34:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2155.codfw.wmnet with OS bullseye
[13:34:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:09] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye completed: - db2...
[13:34:21] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host stat1009.eqiad.wmnet with OS bullseye
[13:34:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye
[13:34:53] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet
[13:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2004.codfw.wmnet
[13:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:19] <urbanecm>	 !log run `CentralAuthUser::importLocalNames` for `MediaWiki message delivery` (T275935)
[13:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:25] <stashbot>	 T275935: Please manually attach new MassMessage system accounts on Wikimedia wikis - https://phabricator.wikimedia.org/T275935
[13:35:57] <urbanecm>	 !log [urbanecm@mwmaint1002 /srv/mediawiki/php]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=rowiki --userlist /home/urbanecm/users.txt # T275935, users.txt has `MediaWiki message delivery` only
[13:36:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host poolcounter2003.codfw.wmnet
[13:36:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[13:36:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[13:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T309311)', diff saved to https://phabricator.wikimedia.org/P30661 and previous config saved to /var/cache/conftool/dbconfig/20220630-133619-ladsgroup.json
[13:36:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:26] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[13:37:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[13:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[13:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30662 and previous config saved to /var/cache/conftool/dbconfig/20220630-133743-ladsgroup.json
[13:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:49] <stashbot>	 T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525
[13:38:25] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:38:28] <logmsgbot>	 !log bmansurov@deploy1002 deploy aborted: (no justification provided) (duration: 00m 03s)
[13:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:41] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[13:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:50] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[13:38:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:02] <urbanecm>	 !log run `CentralAuthUser::importLocalNames` for `New user message` (T311012)
[13:39:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:08] <stashbot>	 T311012: Attach account "New user message" to its global account - https://phabricator.wikimedia.org/T311012
[13:39:38] <urbanecm>	 !log [urbanecm@mwmaint1002 /srv/mediawiki/php]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=metawiki --userlist /home/urbanecm/users.txt # T311012, users.txt has `New user message` only
[13:39:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2003.codfw.wmnet
[13:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:05] <Amir1>	 !log killed refreshLinkRecommendations.php on arzwiki (T299021)
[13:40:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:12] <stashbot>	 T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021
[13:40:20] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:18] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] swift: heavier banhammer for tegola object-server 'access logs' [puppet] - 10https://gerrit.wikimedia.org/r/809966 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi)
[13:42:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Depool poolcounter1005 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809990
[13:43:20] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 302 Found - 564 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:47:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2154.codfw.wmnet with OS bullseye
[13:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:54] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye
[13:48:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2154.codfw.wmnet with reason: host reimage
[13:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:36] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host stat1009.eqiad.wmnet with OS bullseye
[13:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye executed with errors: - stat1...
[13:48:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson)
[13:49:12] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: install php7.4 on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/808909 (https://phabricator.wikimedia.org/T311386)
[13:49:40] <icinga-wm>	 PROBLEM - Check systemd state on poolcounter2003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: host reimage
[13:52:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:03] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[13:53:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) @BTullis @RobH @Papaul  I set the raid up so the raid 1 ssds were first and used the install script for buster.   Buster fails to see to the disks, so I...
[13:54:49] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle)
[13:54:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) @BTullis just read your response on an-presto and see that you're experiencing this with stat1010. Thank you for digging into it more.
[13:55:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30663 and previous config saved to /var/cache/conftool/dbconfig/20220630-135509-ladsgroup.json
[13:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:16] <stashbot>	 T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525
[13:55:28] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle)
[13:55:36] <moritzm>	 !log installing firejail security updates on stretch
[13:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:05] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[13:56:09] <wikibugs>	 (03CR) 10Btullis: Assign new password to Cassandra superuser (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652) (owner: 10Eevans)
[13:56:37] <wikibugs>	 (03CR) 10Majavah: "The API takes YAML not JSON, so this should probably use the generic `data` parameter instead." [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott)
[14:00:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:01:06] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:02:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2154.codfw.wmnet with OS bullseye
[14:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:27] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye completed: - db2...
[14:02:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10BTullis) Thanks @Cmjohnson - yes I think that this is very likely to be the same issue. That's useful that you've experienced exactly the same outcome on this as I...
[14:05:58] <wikibugs>	 (03PS3) 10Eevans: Assign new password to Cassandra superuser [labs/private] - 10https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652)
[14:09:30] <wikibugs>	 (03PS1) 10Btullis: Add a hiera alias for the cassandra superuser password to AQS [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652)
[14:10:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P30664 and previous config saved to /var/cache/conftool/dbconfig/20220630-141014-ladsgroup.json
[14:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:57] <wikibugs>	 (03CR) 10Btullis: "I think that to get a pcc run working we will need to merge this: https://gerrit.wikimedia.org/r/c/labs/private/+/809639" [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis)
[14:13:11] <wikibugs>	 (03CR) 10Tchanders: [C: 04-1] "-1 just for the config name, now we've merged the other patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/808923 (https://phabricator.wikimedia.org/T310646) (owner: 10Hnowlan)
[14:13:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: heavier banhammer for tegola object-server 'access logs' [puppet] - 10https://gerrit.wikimedia.org/r/809966 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi)
[14:14:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] icinga: remove 'monitoring' from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[14:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:14:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bullseye
[14:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:59] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2160.codfw.wmnet with OS bullseye
[14:15:46] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah)
[14:17:20] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:(toolforge|wmcs::paws)::prometheus: improve namespace filtering [puppet] - 10https://gerrit.wikimedia.org/r/807562 (owner: 10Majavah)
[14:19:27] <wikibugs>	 10SRE, 10SRE Observability, 10User-fgiunchedi: systemd state on thanos-fe1001 is flapping - https://phabricator.wikimedia.org/T311322 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Optimistically resolving because I haven't seen any more failures!
[14:20:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36148/console" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[14:20:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
[14:20:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "PCC's happy on alert1001 + lvs1019, merging" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[14:22:41] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
[14:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
[14:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[14:25:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P30665 and previous config saved to /var/cache/conftool/dbconfig/20220630-142519-ladsgroup.json
[14:25:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:15] <wikibugs>	 (03PS5) 10Filippo Giunchedi: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah)
[14:29:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:29:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] icinga: remove 'monitoring' from service::catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[14:30:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah)
[14:30:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Ottomata) We will have to rebuild hadoop for bullsye, eh?  {T310643}
[14:32:07] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:32:14] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Papaul)
[14:32:23] <wikibugs>	 (03PS6) 10Filippo Giunchedi: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah)
[14:33:30] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Papaul) 05Open→03Resolved complete
[14:34:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I think this is now superseded by the blackbox http check" [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond)
[14:34:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2160.codfw.wmnet with reason: host reimage
[14:34:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T309311)', diff saved to https://phabricator.wikimedia.org/P30666 and previous config saved to /var/cache/conftool/dbconfig/20220630-143436-ladsgroup.json
[14:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:46] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[14:34:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I think this is now superseded by the blackbox http check" [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:35:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah)
[14:37:10] <wikibugs>	 (03PS3) 10Cwhite: loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826)
[14:37:37] <wikibugs>	 (03CR) 10Cwhite: loki: add ferm service to control api access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:38:07] <_joe_>	 !log updating python-poolcounter to 0.0.2 across the fleet
[14:38:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2160.codfw.wmnet with reason: host reimage
[14:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30667 and previous config saved to /var/cache/conftool/dbconfig/20220630-144024-ladsgroup.json
[14:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:31] <stashbot>	 T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525
[14:41:11] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[14:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:19] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[14:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:41:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[14:42:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[14:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30668 and previous config saved to /var/cache/conftool/dbconfig/20220630-144204-ladsgroup.json
[14:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:35] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[14:42:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:43] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s)
[14:42:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED
[14:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:44] <wikibugs>	 (03PS5) 10Cwhite: logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826)
[14:46:50] <wikibugs>	 (03CR) 10Cwhite: logstash: duplicate alert logs for loki target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:48:16] <wikibugs>	 (03PS4) 10Cwhite: loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826)
[14:48:47] <wikibugs>	 (03CR) 10Cwhite: loki: add ferm service to control api access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:49:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P30669 and previous config saved to /var/cache/conftool/dbconfig/20220630-144940-ladsgroup.json
[14:49:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:42] <wikibugs>	 (03PS1) 10Majavah: add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003
[14:52:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor1001.eqiad.wmnet
[14:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2160.codfw.wmnet with OS bullseye
[14:52:25] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[14:52:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:30] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2160.codfw.wmnet with OS bullseye completed: - db2...
[14:54:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2156.codfw.wmnet with OS bullseye
[14:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:26] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2156.codfw.wmnet with OS bullseye
[14:54:31] <wikibugs>	 (03CR) 10Herron: [C: 03+1] loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:56:59] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided)
[14:57:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:09] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 10s)
[14:57:10] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[14:57:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:36] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8039322, @elukey wrote: > I checked in the jira that was pointed out earlier, and I noticed two things: >  > 1) Most of the subtasks are related to finding how to test thin...
[14:58:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30670 and previous config saved to /var/cache/conftool/dbconfig/20220630-145820-ladsgroup.json
[14:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:27] <stashbot>	 T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525
[14:58:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:59:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:59:45] <wikibugs>	 (03PS1) 10Elukey: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878)
[14:59:54] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) @Marostegui you can start putting the first 8 in service if you want. leave db2156 for now I am still doing install on it . I had some iss...
[15:00:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[15:00:39] <wikibugs>	 (03PS2) 10Elukey: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878)
[15:02:39] <wikibugs>	 (03PS21) 10Volans: sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[15:02:56] <stephanebisson>	 o/ eamedina47
[15:03:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I've done a full pass and did also some minor adjustment. It looks good to me to start testing it live." [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[15:03:43] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) @Papaul sounds good, so it'd be: 53, 54, 55, 57, 58, 59, 60 for now, right?
[15:03:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[15:03:57] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thumbor1001.eqiad.wmnet
[15:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P30671 and previous config saved to /var/cache/conftool/dbconfig/20220630-150445-ladsgroup.json
[15:04:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:59] <papaul>	 !log ongoing PDU maintenance in Rack A4 CODFW
[15:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:40] <wikibugs>	 (03PS22) 10Volans: sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[15:07:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] sre.network.configure-switch-interfaces: new (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[15:09:25] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[15:09:42] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[15:10:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Alex Monk to contributors [puppet] - 10https://gerrit.wikimedia.org/r/810011 (https://phabricator.wikimedia.org/T308013)
[15:11:41] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF)
[15:11:45] <icinga-wm>	 PROBLEM - Host ps1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:11:53] <wikibugs>	 10SRE, 10Data-Engineering-Kanban, 10Event-Platform, 10serviceops, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10JArguello-WMF)
[15:12:15] <wikibugs>	 (03PS2) 10Cwhite: logstash: add loki output support [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826)
[15:13:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P30672 and previous config saved to /var/cache/conftool/dbconfig/20220630-151325-ladsgroup.json
[15:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2156.codfw.wmnet with reason: host reimage
[15:14:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10JArguello-WMF)
[15:16:09] <wikibugs>	 10SRE, 10Data-Engineering-Kanban, 10Traffic, 10Data Engineering Planning: Spike: Investigate creating robust alerts to notify that caching nodes are not sending traffic data - https://phabricator.wikimedia.org/T304651 (10JArguello-WMF)
[15:16:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Alex Monk to contributors [puppet] - 10https://gerrit.wikimedia.org/r/810011 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:17:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: host reimage
[15:17:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T309311)', diff saved to https://phabricator.wikimedia.org/P30673 and previous config saved to /var/cache/conftool/dbconfig/20220630-151951-ladsgroup.json
[15:19:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Drop references to puppet source files [puppet] - 10https://gerrit.wikimedia.org/r/810014
[15:19:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[15:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:57] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[15:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[15:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:10] <wikibugs>	 (03PS3) 10Cwhite: logstash: add loki output support [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826)
[15:22:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Drop references to puppet source files [puppet] - 10https://gerrit.wikimedia.org/r/810014
[15:23:26] <icinga-wm>	 PROBLEM - Host cp2027 is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:38] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) >>! In T310980#8040624, @Eevans wrote: > I would propose that the way to think about this might be to ask ourselves how much runway we want/need from here to 4.x.  3.11.x is [[ https://ca...
[15:23:44] <icinga-wm>	 PROBLEM - Host kubemaster2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:44] <icinga-wm>	 PROBLEM - Host ms-be2060 is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:44] <icinga-wm>	 PROBLEM - Host ms-be2062 is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:44] <icinga-wm>	 PROBLEM - Host mw2251 is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:44] <icinga-wm>	 PROBLEM - Host mw2252 is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:44] <icinga-wm>	 PROBLEM - Host mw2253 is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:56] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:23:58] <icinga-wm>	 PROBLEM - Host ores2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:02] <icinga-wm>	 PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:04] <elukey>	 mmmmm
[15:24:05] <cwhite>	 uh...
[15:24:08] <sukhe>	 hmm
[15:24:09] <elukey>	 looks like a rack failure
[15:24:12] <icinga-wm>	 PROBLEM - Host people2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:13] <elukey>	 let's check
[15:24:16] <marostegui>	 papaul: ^
[15:24:24] <icinga-wm>	 PROBLEM - Host cp2028 is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:24:24] <icinga-wm>	 PROBLEM - Host ganeti2027 is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:30] <icinga-wm>	 PROBLEM - Host logstash2033 is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:33] <marostegui>	 In case there's something oging on there
[15:24:50] <icinga-wm>	 PROBLEM - Host kafka-main2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stat1010.eqiad.wmnet with reason: host reimage
[15:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:00] <icinga-wm>	 PROBLEM - Host backup2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:05] <cwhite>	 rack A4
[15:25:09] <elukey>	 the rack should be A4 https://netbox.wikimedia.org/dcim/racks/46/
[15:25:12] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:25:12] <icinga-wm>	 PROBLEM - Host backup2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:19] <elukey>	 yeah, maybe the PDU work marostegui ?
[15:25:20] <marostegui>	 jynus: ^
[15:25:22] <icinga-wm>	 PROBLEM - Host mc-gp2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:25] <zabe>	 <papaul> ongoing PDU maintenance in Rack A4 CODFW
[15:25:30] <jynus>	 :-(
[15:25:30] <icinga-wm>	 PROBLEM - Host dbprov2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:37] <jynus>	 not a big deal, it was idle
[15:25:38] <elukey>	 ahhhh thanks zabe 
[15:25:38] <icinga-wm>	 PROBLEM - Host ncredir2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:42] <icinga-wm>	 PROBLEM - Host kafkamon2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:44] <icinga-wm>	 PROBLEM - Host logstash2026 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:44] <icinga-wm>	 PROBLEM - Host orespoolcounter2003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:25:56] <jynus>	 it is networking ?
[15:27:00] <elukey>	 checking ores2001
[15:27:10] <icinga-wm>	 RECOVERY - Host cp2027 is UP: PING WARNING - Packet loss = 90%, RTA = 31.57 ms
[15:27:10] <icinga-wm>	 RECOVERY - Host ganeti2027 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[15:27:12] <icinga-wm>	 RECOVERY - Host people2002 is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms
[15:27:12] <icinga-wm>	 RECOVERY - Host logstash2026 is UP: PING OK - Packet loss = 0%, RTA = 33.00 ms
[15:27:12] <icinga-wm>	 RECOVERY - Host cp2028 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms
[15:27:12] <icinga-wm>	 RECOVERY - Host ncredir2001 is UP: PING OK - Packet loss = 0%, RTA = 31.84 ms
[15:27:12] <icinga-wm>	 RECOVERY - Host backup2002 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms
[15:27:13] <icinga-wm>	 RECOVERY - Host logstash2033 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms
[15:27:14] <icinga-wm>	 RECOVERY - Host kafkamon2002 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms
[15:27:14] <icinga-wm>	 RECOVERY - Host kubemaster2001 is UP: PING OK - Packet loss = 0%, RTA = 38.62 ms
[15:27:14] <icinga-wm>	 RECOVERY - Host mc-gp2001 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[15:27:16] <icinga-wm>	 RECOVERY - Host mw2252 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms
[15:27:16] <icinga-wm>	 RECOVERY - Host ores2001 is UP: PING OK - Packet loss = 0%, RTA = 31.76 ms
[15:27:16] <icinga-wm>	 RECOVERY - Host kafka-main2001 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms
[15:27:18] <icinga-wm>	 RECOVERY - Host orespoolcounter2003 is UP: PING OK - Packet loss = 0%, RTA = 32.05 ms
[15:27:25] <dancy>	 Like a phoenix
[15:27:29] <jynus>	 elukey: let us now based on uptime
[15:27:31] <wikibugs>	 (03PS2) 10Muehlenhoff: ores: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809625 (https://phabricator.wikimedia.org/T308013)
[15:27:37] <jynus>	 if user impact I will start an incident
[15:27:49] <elukey>	 jynus: yeah seems so, I can access the OS but the network is not available
[15:27:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10JArguello-WMF)
[15:28:01] <jynus>	 so network only
[15:28:02] <icinga-wm>	 RECOVERY - Host ms-be2062 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms
[15:28:04] <icinga-wm>	 RECOVERY - Host mw2253 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms
[15:28:08] <icinga-wm>	 RECOVERY - Host mw2251 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms
[15:28:09] <elukey>	 now it works :D
[15:28:10] <icinga-wm>	 RECOVERY - Host dbprov2001 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms
[15:28:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat1010.eqiad.wmnet with reason: host reimage
[15:28:25] <jynus>	 elukey: confirms uptime > 5 minutes, right ?
[15:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:30] <icinga-wm>	 RECOVERY - Host ms-be2060 is UP: PING OK - Packet loss = 0%, RTA = 32.08 ms
[15:28:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P30674 and previous config saved to /var/cache/conftool/dbconfig/20220630-152830-ladsgroup.json
[15:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:40] <elukey>	 jynus: nope 58 days (ores2001)
[15:28:42] <brett>	 s
[15:28:44] <icinga-wm>	 RECOVERY - Host backup2004 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms
[15:28:44] <jynus>	 good
[15:28:52] <jynus>	 anyone can see user impact?
[15:29:03] <jynus>	 I will be looking at graphs and logs
[15:29:14] <icinga-wm>	 RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 32.72 ms
[15:29:17] <jynus>	 app servers and dbs shouldn't be impacted, but other services are active
[15:29:19] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10akosiaris)
[15:29:38] <elukey>	 in theory we should be ok
[15:29:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Drop references to puppet source files [puppet] - 10https://gerrit.wikimedia.org/r/810014 (owner: 10Muehlenhoff)
[15:29:47] <jynus>	 I saw a spike of 5XX
[15:30:10] <jynus>	 althuogh it is not very large
[15:30:31] <jynus>	 https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&var-site=codfw&var-cache_type=varnish-text&var-cache_type=varnish-upload&var-status_type=5&var-method=GET&from=1656602583597&to=1656602995047&viewPanel=1
[15:30:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ores: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809625 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:30:40] <icinga-wm>	 PROBLEM - Host logstash2026 is DOWN: PING CRITICAL - Packet loss = 100%
[15:31:04] <icinga-wm>	 PROBLEM - Host logstash2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:31:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2156.codfw.wmnet with OS bullseye
[15:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:31] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2156.codfw.wmnet with OS bullseye completed: - db2...
[15:31:36] <jynus>	 the services that should be active should have enough redundancy
[15:31:38] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp2028 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:31:41] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10akosiaris) As pointed out in T311732 (now merged as duplicate of...
[15:31:51] <jynus>	 except maybe ganeti (people?)
[15:32:25] <elukey>	 I think that all VMs down have redundancy
[15:32:38] <elukey>	 at least judging from a quick glance
[15:32:57] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/810022
[15:33:05] <jynus>	 please anyone shout if you see anything still bad since 15:23
[15:33:05] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:33:17] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device asw-a-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:33:46] <jynus>	 logstash fallout probably
[15:34:02] <jynus>	 well, logging pipeline in general
[15:34:09] <elukey>	 jynus: I am talking with Papaul on #dcops too
[15:34:46] <wikibugs>	 10SRE, 10Data-Engineering-Kanban, 10Traffic, 10Data Engineering Planning (Sprint 01): Spike: Investigate creating robust alerts to notify that caching nodes are not sending traffic data - https://phabricator.wikimedia.org/T304651 (10JArguello-WMF)
[15:36:03] <wikibugs>	 (03PS1) 10Cwhite: logstash: increase dlq replicas to one [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740)
[15:36:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027
[15:36:32] <icinga-wm>	 PROBLEM - Host logstash2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:36:38] <icinga-wm>	 RECOVERY - Host logstash2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms
[15:36:51] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:47] <jinxer-wm>	 (Emergency syslog message) resolved: Device asw-a-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:37:49] <XioNoX>	 asw-a4-codfw 3:37PM  up 13 mins,
[15:37:56] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (31) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, cloudservices1003, cloudservices1004, db2156, gitlab1001, gitlab1004, gitlab2001, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2
[15:37:56] <icinga-wm>	 nos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[15:38:18] <jynus>	 mmm, 31 hosts is a lot of hosts
[15:38:38] <jynus>	 is that another fallout of the power issue, maybe?
[15:38:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, thank you! See inline" [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah)
[15:38:44] <XioNoX>	 jynus: a rack is usually 40 devices, and that's without counting VMs
[15:38:55] <jynus>	 no, I mean for the puppet alert
[15:38:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:38:59] <XioNoX>	 ah
[15:39:07] <jynus>	 probably puppet failed and trigger the alert
[15:39:13] <jynus>	 hopefully it will recover
[15:39:23] <zabe>	 some of those hosts are at eqiad
[15:39:26] <XioNoX>	 jynus: some hosts are in eqiad
[15:39:27] <XioNoX>	 eh
[15:39:36] <icinga-wm>	 PROBLEM - IPMI Sensor Status on logstash2033 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:39:38] <jynus>	 yeah, but those may be actual errors
[15:39:59] <jynus>	 some of the codfw one shouldn't alert (like db2*)
[15:40:20] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mc-gp2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:41:12] <icinga-wm>	 RECOVERY - Host logstash2026 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms
[15:41:48] <jynus>	 I think that was the last host to come up
[15:42:11] <jynus>	 let's wait for maintenance to complete
[15:42:23] <jynus>	 and the will review if there is any outstanding issue left
[15:42:24] <icinga-wm>	 RECOVERY - Host logstash2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.08 ms
[15:43:34] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stat1010.eqiad.wmnet with OS bullseye
[15:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30675 and previous config saved to /var/cache/conftool/dbconfig/20220630-154335-ladsgroup.json
[15:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye completed: - stat1010 (*...
[15:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:45] <stashbot>	 T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525
[15:44:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: add keyholder alerting (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah)
[15:45:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @nskaggs can you confirm the partman recipe you want?
[15:45:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:46:03] <wikibugs>	 (03PS12) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[15:46:05] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Add ESI testing remap rule [puppet] - 10https://gerrit.wikimedia.org/r/810030 (https://phabricator.wikimedia.org/T308799)
[15:46:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027
[15:46:26] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031
[15:47:38] <wikibugs>	 (03PS2) 10Majavah: add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003
[15:49:15] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) @Marostegui you can do all first 8
[15:49:25] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36151/console" [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah)
[15:50:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: increase dlq replicas to one [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) (owner: 10Cwhite)
[15:50:33] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler: pcc-uploader failing on tools-puppetmaster-02 - https://phabricator.wikimedia.org/T311742 (10taavi)
[15:52:16] <wikibugs>	 (03CR) 10Majavah: add keyholder alerting (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah)
[15:52:56] <wikibugs>	 (03PS1) 10Zabe: acme_chief: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810032 (https://phabricator.wikimedia.org/T308013)
[15:52:58] <wikibugs>	 (03PS1) 10Zabe: certspotter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810033 (https://phabricator.wikimedia.org/T308013)
[15:53:00] <wikibugs>	 (03PS1) 10Zabe: cumin: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013)
[15:53:02] <wikibugs>	 (03PS1) 10Zabe: pdns_server: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810035 (https://phabricator.wikimedia.org/T308013)
[15:53:04] <wikibugs>	 (03PS1) 10Zabe: uwsgi: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810036 (https://phabricator.wikimedia.org/T308013)
[15:53:36] <wikibugs>	 (03PS1) 10Jcrespo: InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215)
[15:53:52] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) (owner: 10Cwhite)
[15:54:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah)
[15:54:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah)
[15:54:45] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] ADD new PDU model to ps1-a4-codfw [puppet] - 10https://gerrit.wikimedia.org/r/809977 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul)
[15:55:52] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: increase dlq replicas to one [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) (owner: 10Cwhite)
[15:56:17] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) (owner: 10Cwhite)
[15:57:43] <wikibugs>	 (03Merged) 10jenkins-bot: add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah)
[15:58:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Just a nit inline, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[15:58:12] <cwhite>	 papaul: going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/809977/
[15:59:03] <wikibugs>	 (03CR) 10Jcrespo: "FYI" [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo)
[15:59:07] <papaul>	  cwhite: yes thanks
[16:00:04] <jouncebot>	 jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:18] <icinga-wm>	 RECOVERY - Host ps1-a4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[16:00:22] <wikibugs>	 (03PS4) 10Cwhite: logstash: add loki output support [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826)
[16:00:25] <wikibugs>	 (03PS1) 10Majavah: keyholder::monitoring: remove nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/810039
[16:00:27] <wikibugs>	 (03PS1) 10Majavah: keyholder::monitoring: drop nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810040
[16:00:29] <wikibugs>	 (03PS1) 10Majavah: keyholder::monitoring: drop absented resources [puppet] - 10https://gerrit.wikimedia.org/r/810041
[16:00:50] <wikibugs>	 (03CR) 10Cwhite: logstash: add loki output support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[16:01:36] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp2028 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:02:51] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) Brilliant! Thanks
[16:06:28] <wikibugs>	 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05RobH→03MoritzMuehlenhoff So I think this is now on Mortiz to roll out the monitoring changes (as he is in the above patchset) and no longer blocked on my testing.  I'm...
[16:07:22] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Enable ESI for /esitest-fa8a495983347898/includer [puppet] - 10https://gerrit.wikimedia.org/r/810044 (https://phabricator.wikimedia.org/T308799)
[16:07:25] <wikibugs>	 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10BTullis) There was one more issue to address with these servers, which (thanks once again to @fgiunchedi) we have now identified and overcome.  It was related to the enumeration/ord...
[16:08:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[16:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[16:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T309311)', diff saved to https://phabricator.wikimedia.org/P30676 and previous config saved to /var/cache/conftool/dbconfig/20220630-160831-ladsgroup.json
[16:08:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:37] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[16:09:39] <icinga-wm>	 RECOVERY - IPMI Sensor Status on logstash2033 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:10:13] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "Looks good to me, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[16:10:26] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mc-gp2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:11:52] <wikibugs>	 (03PS2) 10Jcrespo: InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215)
[16:12:25] <wikibugs>	 (03PS1) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054)
[16:12:27] <wikibugs>	 (03PS1) 10Jdlrobson: Enable Vector grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810046 (https://phabricator.wikimedia.org/T303484)
[16:12:45] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[16:13:41] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul)
[16:14:04] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul)
[16:14:48] <icinga-wm>	 PROBLEM - SSH on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:15:55] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027
[16:15:57] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031
[16:15:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap: drop unused parameters from the configuration [puppet] - 10https://gerrit.wikimedia.org/r/810048
[16:17:26] <icinga-wm>	 PROBLEM - Restbase root url on restbase2018 is CRITICAL: connect to address 10.192.48.120 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[16:17:36] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM, please see optional comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[16:17:45] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-a4-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[16:20:16] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[16:20:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto)
[16:21:46] <wikibugs>	 (03PS8) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[16:24:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[16:27:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10Vgutierrez) @AndyRussG currently in our CDN varnish and ATS runs on the same nodes. All the communication with backend servers/applayer is performed by ats-be (see https...
[16:28:30] <wikibugs>	 (03PS9) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[16:28:32] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[16:28:38] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[16:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[16:30:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) Summarizing yesterday's work:  * Rob updated the BIOS (to latest) and the idrac (one step below latest, latest breaks https idrac interface) - NIC still doesn't det...
[16:31:14] <wikibugs>	 (03PS3) 10Jcrespo: InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215)
[16:31:20] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (30) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, cloudservices1003, cloudservices1004, gitlab1001, gitlab1004, gitlab2001, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, tha
[16:31:20] <icinga-wm>	 02, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[16:32:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[16:32:35] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[16:32:48] <wikibugs>	 (03PS5) 10David Caro: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[16:32:54] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:36] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:54] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[16:36:14] <wikibugs>	 (03CR) 10David Caro: "Just one question, otherwise LGTM (that's a +1 from me if anyone gets to it before me)" [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah)
[16:36:22] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "PCC noop: https://puppet-compiler.wmflabs.org/pcc-worker1001/36152/" [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[16:36:32] <wikibugs>	 (03PS4) 10Jcrespo: InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215)
[16:37:21] <wikibugs>	 (03PS1) 10Sohom Datta: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098)
[16:40:01] <wikibugs>	 (03CR) 10Sohom Datta: "Needs to be enabled after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/806272 is merged 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[16:40:23] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] P:toolforge: drop stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah)
[16:40:58] <wikibugs>	 (03PS1) 10Ladsgroup: Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648)
[16:41:56] <wikibugs>	 (03PS10) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[16:42:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648) (owner: 10Ladsgroup)
[16:43:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[16:43:58] <wikibugs>	 (03PS11) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[16:44:12] <wikibugs>	 (03PS2) 10Ladsgroup: Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648)
[16:45:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[16:46:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10wiki_willy)
[16:47:18] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:47:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10wiki_willy)
[16:52:10] <wikibugs>	 (03PS1) 10Volans: tools/dump: don't dump cluster groups [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810056
[16:53:17] <wikibugs>	 (03CR) 10Jcrespo: "Some example inputs, as with color things are clearer, I think:" [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo)
[16:53:32] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Self-merging to unblock the dumps." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810056 (owner: 10Volans)
[16:54:25] <wikibugs>	 (03Merged) 10jenkins-bot: tools/dump: don't dump cluster groups [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810056 (owner: 10Volans)
[16:55:07] <wikibugs>	 (03PS12) 10Volans: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi)
[16:55:33] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bump to  2022-06-30-114235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810057
[16:56:08] <wikibugs>	 (03PS13) 10Volans: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi)
[16:56:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10EChetty)
[16:57:20] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] cli: Change logging to log on a different file each [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809589 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo)
[16:57:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Prepare for 0.1.3 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809588 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo)
[16:57:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo)
[16:59:32] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to  2022-06-30-114235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810057 (owner: 10MSantos)
[16:59:38] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt1016 - https://phabricator.wikimedia.org/T307825 (10Cmjohnson)
[16:59:46] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt1016 - https://phabricator.wikimedia.org/T307825 (10Cmjohnson) 05Open→03Resolved
[16:59:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531 (10Cmjohnson)
[17:00:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T309311)', diff saved to https://phabricator.wikimedia.org/P30678 and previous config saved to /var/cache/conftool/dbconfig/20220630-170016-ladsgroup.json
[17:00:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:23] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[17:01:40] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Solved all the issues with python 3.10. Can be finally merged. Thanks Arzhel for the initial patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi)
[17:03:35] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to  2022-06-30-114235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810057 (owner: 10MSantos)
[17:04:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi)
[17:05:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo)
[17:06:08] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/810022
[17:06:26] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[17:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:48] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[17:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:54] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36153/console" [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah)
[17:07:02] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[17:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:22] <wikibugs>	 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10Cmjohnson) 05Open→03Resolved removed the port
[17:07:50] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[17:07:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:02] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:09:31] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[17:09:31] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo)
[17:09:33] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10Cmjohnson) there are several new servers that have been racked and tasks have not been updated.  These will get updated as soon as possible
[17:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:42] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: Spike: Investigate creating robust alerts to notify that caching nodes are not sending traffic data - https://phabricator.wikimedia.org/T304651 (10EChetty)
[17:09:53] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) 05Open→03Resolved All open questions (or at least basic ones resolved), basically we will do a "b...
[17:10:12] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[17:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P30679 and previous config saved to /var/cache/conftool/dbconfig/20220630-171522-ladsgroup.json
[17:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) a:05Jclark-ctr→03RobH John fixed it, just pinged me in IRC.  So I'll steal this back and open a case for the NIC issue.
[17:23:08] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] prometheus: probe DNS for (www).wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[17:23:13] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] prometheus: add initial blackbox dns probes for wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[17:25:38] <wikibugs>	 (03PS1) 10Stang: tawikisource: Add English alias for Author/Author_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810060 (https://phabricator.wikimedia.org/T165813)
[17:30:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P30680 and previous config saved to /var/cache/conftool/dbconfig/20220630-173027-ladsgroup.json
[17:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:53] <wikibugs>	 (03CR) 10BCornwall: "Is a wrapper the best way forward for this? I'm normally wary of wrappers because of the risk of complexity and changing tooling from stan" [puppet] - 10https://gerrit.wikimedia.org/r/808984 (owner: 10Jbond)
[17:35:08] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:35:40] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:39:36] <wikibugs>	 (03PS2) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054)
[17:40:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[17:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[17:40:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P30681 and previous config saved to /var/cache/conftool/dbconfig/20220630-174043-ladsgroup.json
[17:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:49] <stashbot>	 T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525
[17:45:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T309311)', diff saved to https://phabricator.wikimedia.org/P30682 and previous config saved to /var/cache/conftool/dbconfig/20220630-174532-ladsgroup.json
[17:45:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[17:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:39] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[17:45:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[17:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T309311)', diff saved to https://phabricator.wikimedia.org/P30683 and previous config saved to /var/cache/conftool/dbconfig/20220630-174603-ladsgroup.json
[17:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:32] <wikibugs>	 (03PS1) 10Eigyan: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759)
[17:49:01] <wikibugs>	 (03PS2) 10Eigyan: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759)
[17:50:50] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: set loki retention to 3d [puppet] - 10https://gerrit.wikimedia.org/r/810064 (https://phabricator.wikimedia.org/T222826)
[17:52:02] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] beta-logs: set loki retention to 3d [puppet] - 10https://gerrit.wikimedia.org/r/810064 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[17:52:06] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:52:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:toolforge: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah)
[17:54:26] <wikibugs>	 (03PS3) 10Eigyan: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759)
[17:55:35] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10ssingh)
[17:55:42] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10ssingh) p:05Triage→03Low
[18:00:05] <jouncebot>	 dduvall and hashar: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1800).
[18:00:12] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8040825, @elukey wrote: >>>! In T310980#8040624, @Eevans wrote: >> I would propose that the way to think about this might be to ask ourselves how much runway we...
[18:00:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P30684 and previous config saved to /var/cache/conftool/dbconfig/20220630-180015-ladsgroup.json
[18:00:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:22] <stashbot>	 T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525
[18:01:11] <wikibugs>	 10SRE, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10ssingh) p:05Triage→03Medium
[18:01:42] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10ssingh) p:05Triage→03Medium
[18:01:49] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10ssingh) p:05Triage→03Medium
[18:04:14] <wikibugs>	 (03PS1) 10Majavah: wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067
[18:09:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah)
[18:10:30] <wikibugs>	 (03PS1) 10Dduvall: all wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810068 (https://phabricator.wikimedia.org/T308071)
[18:10:32] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] all wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810068 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall)
[18:11:13] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810068 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall)
[18:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:15:17] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.18  refs T308071
[18:15:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P30685 and previous config saved to /var/cache/conftool/dbconfig/20220630-181520-ladsgroup.json
[18:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:23] <stashbot>	 T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071
[18:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:20:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:06] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:27:37] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10Cmjohnson) 05Open→03Resolved these have been updated with the msw servers
[18:30:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P30686 and previous config saved to /var/cache/conftool/dbconfig/20220630-183025-ladsgroup.json
[18:30:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:46] <wikibugs>	 (03CR) 10BCornwall: spdx: Add csr files to the list of files to ignore. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond)
[18:45:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P30687 and previous config saved to /var/cache/conftool/dbconfig/20220630-184530-ladsgroup.json
[18:45:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:38] <stashbot>	 T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525
[18:47:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T309311)', diff saved to https://phabricator.wikimedia.org/P30688 and previous config saved to /var/cache/conftool/dbconfig/20220630-184708-ladsgroup.json
[18:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:14] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[18:48:34] <wikibugs>	 (03CR) 10Majavah: mediawiki: Split updateSpecialPages.php job to be per-shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm)
[18:55:17] <wikibugs>	 (03PS3) 10Dzahn: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738)
[19:02:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P30689 and previous config saved to /var/cache/conftool/dbconfig/20220630-190213-ladsgroup.json
[19:02:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jgreen)
[19:05:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Phabricator: Remove unneeded translation overrides [puppet] - 10https://gerrit.wikimedia.org/r/809907 (https://phabricator.wikimedia.org/T309746) (owner: 10Aklapper)
[19:09:44] <wikibugs>	 (03PS1) 10Stang: RecentChange: Straight join to actor table when needed [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809959 (https://phabricator.wikimedia.org/T311360)
[19:14:58] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:16:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] admin: allow sudo for jclark-ctr for cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) (owner: 10Ssingh)
[19:17:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P30690 and previous config saved to /var/cache/conftool/dbconfig/20220630-191718-ladsgroup.json
[19:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:28] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Arlolra)
[19:21:56] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666)
[19:22:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott)
[19:24:01] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add tests for policy.wikimedia.org, fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/810073 (https://phabricator.wikimedia.org/T310738)
[19:24:58] <wikibugs>	 (03CR) 10Dzahn: "test for this here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/810073" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[19:25:29] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666)
[19:27:02] <wikibugs>	 (03CR) 10Ottomata: Add a new Eventgate stream for revision-score events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[19:27:45] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)
[19:32:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T309311)', diff saved to https://phabricator.wikimedia.org/P30691 and previous config saved to /var/cache/conftool/dbconfig/20220630-193223-ladsgroup.json
[19:32:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[19:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:30] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[19:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[19:32:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T309311)', diff saved to https://phabricator.wikimedia.org/P30692 and previous config saved to /var/cache/conftool/dbconfig/20220630-193254-ladsgroup.json
[19:33:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:40:32] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:42:48] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:53:55] <wikibugs>	 (03CR) 10RLazarus: "> So I am actually not 100% sure which should go first." [puppet] - 10https://gerrit.wikimedia.org/r/810073 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[19:54:34] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:56:03] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810077 (https://phabricator.wikimedia.org/T128546)
[19:56:56] <wikibugs>	 (03PS4) 10Dzahn: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738)
[19:57:50] <wikibugs>	 (03CR) 10Dzahn: httpbb: add tests for policy.wikimedia.org, fixcopyright.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810073 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[19:57:59] <wikibugs>	 (03Abandoned) 10Dzahn: httpbb: add tests for policy.wikimedia.org, fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/810073 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[20:00:04] <jouncebot>	 brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T2000).
[20:00:04] <jouncebot>	 danisztls, kart_, Jdlrobson, koi, jan_drewniak, and eigyan: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <danisztls>	 greetings
[20:00:18] * kart_ is here
[20:00:28] <eigyan>	 0/
[20:00:35] <eigyan>	 o/
[20:00:36] <koi>	 o/
[20:00:45] <jan_drewniak>	 0/ ( I can do mine)
[20:00:46] <eigyan>	 Greetings
[20:00:58] * kart_ will also self-deploy
[20:01:16] * urbanecm waves
[20:01:25] <Jdlrobson>	 o/
[20:01:56] <thcipriani>	 howdy all
[20:03:13] <urbanecm>	 hi thcipriani. i'm around if any help with the window's needed :)
[20:03:22] <thcipriani>	 thanks urbanecm 
[20:03:29] <kart_>	 thcipriani: I think we need to fix comment about Beta feature permission in wmf-config/InitialiseSettings.php#16976
[20:03:32] <wikibugs>	 (03PS12) 10Thcipriani: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:04:33] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: Bump container version to 2022-06-30-170012-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810079 (https://phabricator.wikimedia.org/T303444)
[20:04:47] * TheresNoTime is also around to help if the bottom of the barrel needs scraping :D
[20:05:08] <kart_>	 That reminds me to fix Beta Feature comment about cx. Last updated in 2019! :)
[20:05:36] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Update analytics refine job version in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/787718 (owner: 10Aqu)
[20:05:47] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs __init__.py: don't specify json_output when calling run_formatted_as [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810080
[20:05:51] <thcipriani>	 kart_: which patch?
[20:06:05] <kart_>	 thcipriani: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/809165
[20:07:51] <thcipriani>	 kart_: oh, yes that :D
[20:08:04] <thcipriani>	 I'll file a task for that after this 
[20:08:24] <kart_>	 thcipriani: Thanks!
[20:09:20] <stephanebisson>	 kart_ o/
[20:10:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs __init__.py: don't specify json_output when calling run_formatted_as [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810080 (owner: 10Andrew Bogott)
[20:11:46] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2022-06-30-170012-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810079 (https://phabricator.wikimedia.org/T303444) (owner: 10BryanDavis)
[20:12:33] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:13:50] <wikibugs>	 (03Merged) 10jenkins-bot: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:14:24] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:806960]] QuickSurveys: Deploy research-incentive to jawiki
[20:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:40] <wikibugs>	 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10BCornwall) @Vgutierrez Indeed, do you have any reason to keep these *specific* instances around, or are you okay with a replacement?
[20:15:01] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Bump container version to 2022-06-30-170012-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810079 (https://phabricator.wikimedia.org/T303444) (owner: 10BryanDavis)
[20:15:25] <danisztls>	 thanks, thcipriani
[20:16:14] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply
[20:16:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:27] <urbanecm>	 thcipriani: pardon my ignorance, but i'm curious why a full scap for a config change? :-) is it that quick those days?
[20:17:08] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[20:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:23] <kart_>	 Oh, I was about to ask when log showed 'started scap..' :)
[20:17:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:21] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply
[20:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:39] <wikibugs>	 (03CR) 10Jsn.sherman: [C: 03+1] "LGTM! I can see now why we were not producing the expected stay previously. Nice work!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan)
[20:18:44] <thcipriani>	 urbanecm: heh, yeah, we're trying to start using a full scap more and we're testing a new scap command (not yet ready for primetime) called "scap backport" so I typed "scap backport 806960" and it merged the change, staged it, and started a sync (although it was supposed to stage it on mwdebug first :D)
[20:19:08] <thcipriani>	 oh wait: it did!
[20:19:16] <urbanecm>	 interesting initiative :)
[20:19:18] <wikibugs>	 (03PS3) 10Sbisson: Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143)
[20:19:35] <thcipriani>	 danisztls: your change is on mwdebug1002, check please!
[20:19:40] <kart_>	 thcipriani: that's cool.
[20:19:50] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[20:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:56] * bd808 still dreams of 100% automated, hands-free CD
[20:20:00] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:20:27] <wikibugs>	 (03PS4) 10Eigyan: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759)
[20:20:53] <wikibugs>	 (03CR) 10KartikMistry: "Relend will update comment about Beta feature permission." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson)
[20:21:08] <kart_>	 Typo :/
[20:21:11] <brennen>	 bd808: these are steps in that direction
[20:21:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:21:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:22:00] <bd808>	 brennen: *nod* mw-on-k8s seems like a good poke to push that way
[20:22:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:35] <danisztls>	 thcipriani: the 'enabled' flag was set to false, can I do a follow-up?
[20:23:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:27] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[20:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:39] <wikibugs>	 (03CR) 10DDesouza: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:23:39] <thcipriani>	 danisztls: I suppose although we've got a lot of patches in this window. Is this fine to sync? Or should I revert?
[20:24:04] <danisztls>	 thcipriani: it's fine to sync
[20:24:17] <danisztls>	 thcipriani: it will only be disabled
[20:24:25] <thcipriani>	 ah, ok, going live.
[20:24:48] <wikibugs>	 (03PS1) 10Dzahn: vtrs: add promtheus blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/810087
[20:24:50] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[20:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:30] <bd808>	 !log Rebuilding Toolhub search indices
[20:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:38] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson)
[20:29:25] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson)
[20:30:33] <kart_>	 thcipriani: You are deploying https://gerrit.wikimedia.org/r/809165, right?
[20:30:44] <kart_>	 stephanebisson will test it :)
[20:31:02] <thcipriani>	 kart_: cool, I just merged that one, still need to stage it
[20:33:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:17] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) 05Open→03Resolved a:03Krinkle Any remaining "smells like opcache" problems we see can't be the cause of php-opcache revalidation m...
[20:35:35] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] RecentChange: Straight join to actor table when needed [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809959 (https://phabricator.wikimedia.org/T311360) (owner: 10Stang)
[20:36:39] <Jdlrobson>	 thcipriani: FYI my first patch is beta cluster only if you want to hit +2 on that now
[20:37:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:37:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:37:10] <thcipriani>	 Jdlrobson: ah, neat, thanks for the poke that'll help :)
[20:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:14] <wikibugs>	 (03CR) 10RLazarus: "LGTM in principle. One bug in the tests but then it will be ready to go, and the rollout plan in the commit message sounds reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[20:37:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:27] <Jdlrobson>	 (https://gerrit.wikimedia.org/r/c/810046/)
[20:37:36] <Jdlrobson>	 since the window is a bit packed :)
[20:37:59] <thcipriani>	 ah, but it looks like it has a dependency chain that gerrit isn't happy about when I try to rebase :\
[20:38:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:18] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:806960]] QuickSurveys: Deploy research-incentive to jawiki (duration: 23m 53s)
[20:38:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:21] <thcipriani>	 stephanebisson: your change is on mwdebug1002, check please
[20:39:28] <stephanebisson>	 thcipriani, kart_: I'm on it, will need a good 5 minutes
[20:39:50] <Jdlrobson>	 thcipriani: ahh my bad
[20:40:13] <Jdlrobson>	 okay.. well hopefully the first config change will go quickly
[20:40:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T309311)', diff saved to https://phabricator.wikimedia.org/P30693 and previous config saved to /var/cache/conftool/dbconfig/20220630-204043-ladsgroup.json
[20:40:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:49] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[20:41:40] <wikibugs>	 (03PS3) 10Thcipriani: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson)
[20:41:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:41:54] <kart_>	 stephanebisson: no problem. Beta feature part is activated.
[20:42:12] <wikibugs>	 (03PS3) 10DDesouza: QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015)
[20:42:58] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson)
[20:45:04] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:45:24] <wikibugs>	 (03Merged) 10jenkins-bot: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson)
[20:46:19] <wikibugs>	 (03PS6) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[20:48:15] <wikibugs>	 (03PS4) 10DDesouza: QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015)
[20:48:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:49:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:55] <urandom>	 Is there someone around with the time to look at restbase2018?  It seems...down(ish), and I can't get in via ssh.
[20:50:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:25] <urandom>	 To be clear:  It's better up than down, but it's not an emergency :)
[20:51:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[20:51:47] <urandom>	 The outage length is coming up on the hint window though, so bringing it up reduces the likelihood of any replica loss. 
[20:52:12] <stephanebisson>	 thcipriani all good, please sync
[20:55:05] <thcipriani>	 stephanebisson: thanks for checking, going now
[20:55:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P30694 and previous config saved to /var/cache/conftool/dbconfig/20220630-205548-ladsgroup.json
[20:55:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:43] <wikibugs>	 (03Merged) 10jenkins-bot: RecentChange: Straight join to actor table when needed [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809959 (https://phabricator.wikimedia.org/T311360) (owner: 10Stang)
[20:59:05] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:809165|Enable Wikistories on idwiki (T311143)]] (duration: 03m 31s)
[20:59:06] <thcipriani>	 Jdlrobson: your second change is on mwdebug1002, check please
[20:59:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:11] <stashbot>	 T311143: Deploy Wikistories to production - https://phabricator.wikimedia.org/T311143
[20:59:14] <thcipriani>	 stephanebisson: your change should be live now
[20:59:26] <Jdlrobson>	 testing now...
[20:59:28] <wikibugs>	 (03PS2) 10Thcipriani: Enable Vector grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810046 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson)
[20:59:33] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Enable Vector grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810046 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson)
[20:59:49] <stephanebisson>	 thcipriani Thanks!
[21:00:35] <Jdlrobson>	 thcipriani: it looks like my expression is wrong.. sigh
[21:01:35] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Vector grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810046 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson)
[21:02:03] <thcipriani>	 bummer :(
[21:02:14] <thcipriani>	 koi: your wmf.18 change is on mwdebug1002, check please
[21:03:07] <koi>	 looking
[21:05:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:05:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:06:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:02] <koi>	 thcipriani: pretty sad, issue still exist, let's revert it
[21:07:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:07:31] <thcipriani>	 koi: :( ok
[21:07:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:36] <thcipriani>	 thanks for checking
[21:07:45] <wikibugs>	 (03PS1) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054)
[21:07:50] <Jdlrobson>	 thcipriani: i am doing it the old fashion way..
[21:08:10] <Jdlrobson>	 the dblist expressions don't seem work how I think they work
[21:08:16] <wikibugs>	 (03PS1) 10Thcipriani: Revert "RecentChange: Straight join to actor table when needed" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962
[21:08:25] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "RecentChange: Straight join to actor table when needed" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 (owner: 10Thcipriani)
[21:09:14] <koi>	 sorry for another twenty minutes waiting 0 0
[21:10:34] <TheresNoTime>	 koi: damn :/
[21:10:49] <thcipriani>	 koi: no worries we can sync the other stuff while we're waiting on this backport: no big deal <3
[21:10:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P30695 and previous config saved to /var/cache/conftool/dbconfig/20220630-211053-ladsgroup.json
[21:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:05] <urbanecm>	 Jdlrobson: i didn't follow what happened closely, but if it just didn't work at all, i think that's because https://github.com/wikimedia/operations-mediawiki-config/blob/master/multiversion/MWConfigCacheGenerator.php#L27 wasn't updated AFAICS
[21:11:57] <Jdlrobson>	 urbanecm: thanks. This list is small enough I think it's fine to not use dblist expression. I just regret not doing that from the start now
[21:12:38] <urbanecm>	 Jdlrobson: yeah, for sure. enumerating wikis is generally preferred from dblists, so this is certainly better. was just for your information :)
[21:12:52] <Jdlrobson>	 <3 appreciate it urbanecm 
[21:12:52] <stephanebisson>	 thcipriani somehow my patch doesn't seem to be sync'd everywhere. When I refresh, sometimes the code is there sometimes it isn't. Is there a long replication delay or could there be a problem?
[21:15:22] <RhinosF1>	 There's been trouble with things not fully syncing the last few days
[21:15:31] <urbanecm>	 unfortunately :/
[21:16:15] <thcipriani>	 fun
[21:16:31] <brennen>	 RhinosF1 / urbanecm - know of that being tracked anywhere?
[21:16:56] <thcipriani>	 Jdlrobson: I'm going to revert yours for now so I can clear out the window
[21:16:56] <urbanecm>	 stephanebisson: would you mind sharing which mw server works correctly and which one doesn't? should be available in the `server` header in your devtools
[21:17:06] <RhinosF1>	 brennen: I do not
[21:17:10] <Jdlrobson>	 thcipriani: hang on
[21:17:14] <stephanebisson>	 urbanecm I'll try...
[21:17:19] <urbanecm>	 brennen: not 100% sure. dancy might know, as they helped with debugging it the other day.
[21:17:22] <Jdlrobson>	 thcipriani: mines time sensitive so it might be better if it just goes out as is
[21:17:33] <Jdlrobson>	 it's basically enabling to more wikis than it should do
[21:17:42] <thcipriani>	 what wikis shouldn't be there?
[21:17:46] <Jdlrobson>	 french wikipedia
[21:18:10] <stephanebisson>	 urbanecm mw1414.eqiad.wmnet (not updated)
[21:18:14] <Jdlrobson>	 ideally it wouldn't go out there but the purpose of this change was to get it it in front of gadget developer eyes
[21:18:17] <thcipriani>	 I don't see that in the dblist file? I see frwikiquote and frwiktionary
[21:18:23] <Jdlrobson>	 yeh that's the problem :)
[21:18:24] <zabe>	 mw1414 already made problems the other day
[21:18:30] <Jdlrobson>	 the dblist did the invert of what i wanted
[21:18:47] <stephanebisson>	 urbanecm mw1413.eqiad.wmnet (updated)
[21:18:57] <RhinosF1>	 zabe: that's interesting
[21:19:02] <urbanecm>	 stephanebisson: thanks, that's helpful. mw1414 and 
[21:19:02] <Jdlrobson>	 I think you are not going to have a clean revert path because of the beta cluster change as well
[21:19:10] <Jdlrobson>	 sorry for the mess :(
[21:19:32] <thcipriani>	 Jdlrobson: do you need to flip the false and the true in IS.php then?
[21:19:35] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/810099
[21:19:42] <Jdlrobson>	 that fixes the problem by getting rid of the dblist altogether
[21:20:09] <Jdlrobson>	 If the window needs to finish and I'm allowed I could see if cjming can help deal with this later today
[21:20:26] <icinga-wm>	 RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:20:28] <stephanebisson>	 urbanecm mw1369 also not up to date
[21:20:38] <urbanecm>	 thanks
[21:21:00] <stephanebisson>	 urbanecm anything we can do to fully sync?
[21:21:04] <dancy>	 Hmmm 
[21:21:25] <wikibugs>	 (03PS2) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054)
[21:21:26] <urbanecm>	 yeah. i just want to confirm it's the same thing that was observed yesterday
[21:21:30] <dancy>	 scap sync-wikiversions 
[21:21:37] <wikibugs>	 (03PS3) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054)
[21:21:43] <dancy>	 That's what I used yesterday.
[21:21:55] <cjming>	 Jdlrobson: i'm around - i think Tyler is reverting your config now
[21:22:10] <urbanecm>	 and yes, both mw1414 and 1413 does have the new code, but the web server didn't pick it up for some reason.
[21:22:45] <zabe>	 (the other day it were at least mw1414, mw1415, mw1416, mw1417, mw1418, mw1447 and mw1450)
[21:22:48] <urbanecm>	 thcipriani: I'm not going to step on your toes too much, so i'll leave the blank scap sync-wikiversions to mitigate the issue on you :). meanwhile i'll phabricatorize it.
[21:23:10] <dancy>	 Thank you.
[21:23:18] <RhinosF1>	 I'm going to go to sleep because I think far more capable hands are here
[21:23:24] <brennen>	 thanks urbanecm 
[21:23:28] <urbanecm>	 good night RhinosF1 
[21:23:35] <RhinosF1>	 Night urbanecm
[21:24:06] <dancy>	 To me this means that php-fpm restart isn't hitting all of the necessary hosts
[21:24:28] <urbanecm>	 or it is, but the script itself doesn't work (also plausible)
[21:24:31] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Enable Vector grid on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810101
[21:24:33] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Vector: Deploy title above tabs to all opt-in wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810102
[21:24:34] <urbanecm>	 doesn't *always work
[21:24:52] <dancy>	 Agreed 
[21:25:06] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "Enable Vector grid on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810101 (owner: 10Thcipriani)
[21:25:11] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "Vector: Deploy title above tabs to all opt-in wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810102 (owner: 10Thcipriani)
[21:25:16] <dancy>	 Heading back to my desk to investigate 
[21:25:22] <wikibugs>	 (03Abandoned) 10Andrew Bogott: wmcs __init__.py: don't specify json_output when calling run_formatted_as [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810080 (owner: 10Andrew Bogott)
[21:25:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable Vector grid on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810101 (owner: 10Thcipriani)
[21:25:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T309311)', diff saved to https://phabricator.wikimedia.org/P30696 and previous config saved to /var/cache/conftool/dbconfig/20220630-212558-ladsgroup.json
[21:25:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Vector: Deploy title above tabs to all opt-in wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810102 (owner: 10Thcipriani)
[21:26:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:05] <wikibugs>	 (03PS2) 10Thcipriani: tawikisource: Add English alias for Author/Author_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810060 (https://phabricator.wikimedia.org/T165813) (owner: 10Stang)
[21:26:05] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[21:26:09] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] tawikisource: Add English alias for Author/Author_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810060 (https://phabricator.wikimedia.org/T165813) (owner: 10Stang)
[21:26:21] <wikibugs>	 (03PS4) 10Andrew Bogott: wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666)
[21:26:23] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-makedomain: forward to python3 [puppet] - 10https://gerrit.wikimedia.org/r/810103
[21:27:01] <wikibugs>	 (03Merged) 10jenkins-bot: tawikisource: Add English alias for Author/Author_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810060 (https://phabricator.wikimedia.org/T165813) (owner: 10Stang)
[21:27:33] <thcipriani>	 koi: your wmf-config change is live on mwdebug1002
[21:27:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "RecentChange: Straight join to actor table when needed" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 (owner: 10Thcipriani)
[21:27:50] <koi>	 looking
[21:28:24] <stephanebisson>	 thcipriani do you plan on running the sync mentioned above to make sure my config change syncs everywhere?
[21:28:59] <koi>	 thcipriani: LGTM
[21:29:08] <thcipriani>	 stephanebisson: sure, this will go live with koi's change
[21:29:31] <stephanebisson>	 thcipriani thanks
[21:29:58] <urbanecm>	 brennen: dancy: thcipriani: stephanebisson: fyi: i phabricatorized the issue as https://phabricator.wikimedia.org/T311788. 
[21:30:08] <brennen>	 thx
[21:31:19] <urbanecm>	 not sure if we should send sth like "sync everything twice" to ops-l, or if fixing it would be quick. 
[21:31:49] <thcipriani>	 dancy: FWIW it's restarting 307 + 9 hosts which I note is different than the 348 + 9 it syncs to :)
[21:31:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-makedomain: forward to python3 [puppet] - 10https://gerrit.wikimedia.org/r/810103 (owner: 10Andrew Bogott)
[21:32:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:33:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:55] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810060|tawikisource: Add English alias for Author/Author_talk namespace (T165813)]] (duration: 03m 42s)
[21:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:04] <stashbot>	 T165813: Create Author: namespace on Tamil wikisource - https://phabricator.wikimedia.org/T165813
[21:34:15] <thcipriani>	 ^ koi and stephanebisson should be live now
[21:34:16] <dancy>	 back
[21:34:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:34:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:08] <stephanebisson>	 thcipriani looks good, thanks again
[21:35:18] <dancy>	 thcipriani: There were no complains from the restart script during the deployment?
[21:35:37] <thcipriani>	 stephanebisson: that's good, yw
[21:35:40] <thcipriani>	 dancy: nope
[21:35:41] <koi>	 thcipriani: thanks, one more thing, would you like to run namespaceDupes.php at tawikisource as mentioned in T165813
[21:35:45] <dancy>	 gah
[21:35:50] <thcipriani>	 koi: sure
[21:36:51] <thcipriani>	 koi: blerg, looks like there are a few manual fixes needed here
[21:37:34] <thcipriani>	 koi: https://phabricator.wikimedia.org/P30697
[21:37:47] <dancy>	 I'd like a copy of the deployment transcript 
[21:37:58] <urbanecm>	 thcipriani: can you run it with something like --add-prefix=BROKEN (so it can be resolved on wiki)? :)
[21:38:02] <wikibugs>	 (03PS1) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106
[21:38:40] <wikibugs>	 (03PS1) 10Andrew Bogott: Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107
[21:38:44] <thcipriani>	 urbanecm: that one is news to me! Will it only do it with conflicts? (/me hopes)
[21:38:55] <urbanecm>	 yeah. it's like a backup plan :)
[21:39:01] <Jdlrobson>	 thcipriani: i'll follow up with Clare regarding my patches. Sorry to add to the drama today!
[21:39:09] <thcipriani>	 Jdlrobson: <3
[21:39:12] <wikibugs>	 (03PS7) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[21:39:29] <thcipriani>	 urbanecm: TIL, I'll do that
[21:39:35] <urbanecm>	 👍
[21:39:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:43] <wikibugs>	 (03PS2) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106
[21:40:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] striker: connect docker container directly to host network [puppet] - 10https://gerrit.wikimedia.org/r/809714 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis)
[21:40:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:40:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:40:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[21:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:52] <thcipriani>	 thanks urbanecm -- koi all done, I'll get a paste on that ticket
[21:40:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:41:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:14] <wikibugs>	 (03PS2) 10Thcipriani: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810077 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[21:42:20] <koi>	 thanks a lot!
[21:42:25] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810077 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[21:43:34] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810077 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[21:43:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[21:44:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[21:44:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott)
[21:45:33] <thcipriani>	 dancy: here was the output of the deploy that evidently didn't go everywhere: https://phabricator.wikimedia.org/P30698
[21:46:29] <dancy>	 thx
[21:46:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:05] <thcipriani>	 jan_drewniak: still using portals/sync-portals, correct?
[21:47:30] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:47:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:47:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:30] <jan_drewniak>	 thcipriani: yeah thanks
[21:48:35] <thcipriani>	 jan_drewniak: it's live on mwdebug1002 if there are things you need to check?
[21:48:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:48:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:56] <wikibugs>	 (03PS5) 10Thcipriani: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan)
[21:48:58] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:49:28] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan)
[21:49:33] <jan_drewniak>	 thcipriani: ok looks good
[21:49:44] <thcipriani>	 jan_drewniak: cool, thanks for checking, running sync-portals now
[21:50:18] <wikibugs>	 (03Merged) 10jenkins-bot: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan)
[21:52:15] <wikibugs>	 (03PS1) 10Jdlrobson: Enable grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810109 (https://phabricator.wikimedia.org/T303484)
[21:53:16] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized portals/wikipedia.org/assets: Config: [[gerrit:810077|Bumping portals to master (T128546)]] (duration: 03m 24s)
[21:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:28] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[21:53:36] <wikibugs>	 (03PS4) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054)
[21:53:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:53:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:54:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:51] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized portals: Config: [[gerrit:810077|Bumping portals to master (T128546)]] (duration: 03m 34s)
[21:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:59] <thcipriani>	 ^ jan_drewniak all done!
[21:57:06] <thcipriani>	 eigyan: still around?
[21:57:15] <eigyan>	 hes inideed!
[21:57:18] <eigyan>	 yes
[21:57:40] <eigyan>	 sorry got so excited to type...lol
[21:58:00] <thcipriani>	 :D
[21:58:22] <thcipriani>	 eigyan: your change is live on mwdebug1002, check please
[21:58:29] <eigyan>	 Thank you thcipriani I will check
[21:58:54] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:00:49] <eigyan>	 thcipriani surveys are working as expected. Thank you so much for enduring this late deploy for us all!
[22:01:07] <thcipriani>	 eigyan: glad to hear it, going live
[22:01:13] <wikibugs>	 (03PS5) 10Dzahn: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738)
[22:01:14] <eigyan>	 Excellent!
[22:01:29] <wikibugs>	 (03CR) 10Dzahn: mediawiki: redirect policy and related sites to wikimediafoundation.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[22:01:37] <wikibugs>	 (03CR) 10Dzahn: "fixed yaml trap" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[22:03:34] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[22:05:15] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn)
[22:06:25] <wikibugs>	 (03PS2) 10Dzahn: vtrs: add promtheus blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190)
[22:06:33] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810063|[wmf-config]: Deploy GDI Survey 2 on EN and FA wikis (T311759)]] (duration: 03m 16s)
[22:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:40] <stashbot>	 T311759: Deploy GDI Safety Survey Wave 2 on EN and FA wikis  - https://phabricator.wikimedia.org/T311759
[22:06:51] <thcipriani>	 ^ eigyan should be live now!
[22:06:54] <thcipriani>	 kudos
[22:07:11] <eigyan>	 Awesome many thanks to you thcipriani!
[22:08:01] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Enable grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810109 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson)
[22:08:27] <cjming>	 fyi - just doing a few more backports before closing this window
[22:08:50] <wikibugs>	 (03Merged) 10jenkins-bot: Enable grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810109 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson)
[22:10:01] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn)
[22:10:21] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson)
[22:11:18] <wikibugs>	 (03Merged) 10jenkins-bot: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson)
[22:11:20] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) @Krinkle and I agreed on doing this tomorrow at 14:00 PST
[22:11:45] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn)
[22:11:47] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) 05Open→03In progress
[22:13:27] <cjming>	 Jdlrobson: config change is up on mwdebug1002 if you want to verify
[22:13:52] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:810109|Enable grid on beta cluster (T303484)]] (duration: 03m 43s)
[22:13:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:59] <stashbot>	 T303484: Introduce basic grid system to modern Vector - https://phabricator.wikimedia.org/T303484
[22:14:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:14:47] <danisztls>	 cjming: can you do 809961? if not no problem
[22:15:14] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:15:29] <cjming>	 danisztls: can you add to deployment cal? i can do it - that's just enabling the survey right?
[22:16:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:42] <danisztls>	 cjming: done, thanks!
[22:16:50] <danisztls>	 yes, just enabling it
[22:17:04] <cjming>	 np
[22:17:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:17:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:02] <Jdlrobson>	 sync away!
[22:19:08] <cjming>	 syncing!
[22:19:43] <wikibugs>	 (03PS5) 10Clare Ming: QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[22:20:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:53] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[22:22:40] <wikibugs>	 (03Merged) 10jenkins-bot: QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[22:23:03] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810099|Vector: Deploy title above tabs to all opt-in wikis (T310054)]] (duration: 03m 36s)
[22:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:09] <stashbot>	 T310054: Deploy new toolbar order - https://phabricator.wikimedia.org/T310054
[22:23:15] <cjming>	 Jdlrobson: ^^ live
[22:24:12] <cjming>	 danisztls: can you see survey on mwdebug1002?
[22:24:42] <wikibugs>	 (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/808984 (owner: 10Jbond)
[22:24:43] <danisztls>	 cjming: yes, lgtm
[22:24:50] <cjming>	 cool - going live then
[22:25:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:26:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:26:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:26] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[22:28:45] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:809961|QuickSurveys: Enable 'research-incentive' survey on 'jawiki' (T311015)]] (duration: 03m 40s)
[22:28:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:51] <stashbot>	 T311015: Deploy QuickSurvey on Japanese Wikipedia - https://phabricator.wikimedia.org/T311015
[22:28:56] <wikibugs>	 (03PS1) 10BryanDavis: striker: Bump container version to 2022-06-29-004157-production [puppet] - 10https://gerrit.wikimedia.org/r/810118
[22:29:06] <cjming>	 danisztls: survey should be live
[22:30:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[22:30:52] <danisztls>	 does it take a while to sync?
[22:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:18] <cjming>	 danisztls: it's done syncing - should be live now
[22:32:03] * dancy eyes
[22:34:34] <danisztls>	 cjming: it's showing the survey on mwdebug but not on prod
[22:34:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:35:06] * dancy shakes a fist
[22:36:21] <dancy>	 well... perhaps I should reserve fist shaking until it's confirmed that this is the same problem. :-)
[22:36:35] <danisztls>	 it's showing now
[22:36:38] <dancy>	 whew!
[22:36:39] <cjming>	 oh good
[22:36:43] * dancy unshakes fist
[22:36:45] <cjming>	 lol
[22:37:02] <danisztls>	 I'm curious about the cause now
[22:37:14] <danisztls>	 Thanks cjming 
[22:37:17] <cjming>	 np!
[22:37:24] <cjming>	 closing the window at long last
[22:39:00] <cjming>	 !log end of UTC late backport and config training window
[22:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:22] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[22:53:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2161.mgmt.codfw.wmnet with reboot policy FORCED
[22:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:54:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2162.mgmt.codfw.wmnet with reboot policy FORCED
[22:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-06-29-004157-production [puppet] - 10https://gerrit.wikimedia.org/r/810118 (owner: 10BryanDavis)
[23:39:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:39:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2162.mgmt.codfw.wmnet with reboot policy FORCED
[23:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2161.mgmt.codfw.wmnet with reboot policy FORCED
[23:42:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:42:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2163.mgmt.codfw.wmnet with reboot policy FORCED
[23:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:00] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2164.mgmt.codfw.wmnet with reboot policy FORCED
[23:43:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:16] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:46:25] <wikibugs>	 (03PS6) 10Krinkle: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn)
[23:48:54] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2164.mgmt.codfw.wmnet with reboot policy FORCED
[23:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2164.mgmt.codfw.wmnet with reboot policy FORCED
[23:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:57:43] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2164.mgmt.codfw.wmnet with reboot policy FORCED
[23:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:58] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down