[00:33:52] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[00:42:08] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[00:42:56] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:38] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.83 ms
[01:30:22] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:39:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:56] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:43:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] openstack.galera: add nodecheck logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/809100 (owner: 10David Caro)
[01:46:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] openstack: update control nodes after refresh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822735 (owner: 10David Caro)
[01:49:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32382 and previous config saved to /var/cache/conftool/dbconfig/20220815-015020-ladsgroup.json
[01:50:25] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[01:52:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] wmcs.neutron: Add alert to open a task when agent down [alerts] - 10https://gerrit.wikimedia.org/r/822319 (owner: 10David Caro)
[01:55:13] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[01:57:14] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:01:08] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms
[02:02:40] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:04:10] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:05:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32383 and previous config saved to /var/cache/conftool/dbconfig/20220815-020526-ladsgroup.json
[02:07:02] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:09:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:19:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32384 and previous config saved to /var/cache/conftool/dbconfig/20220815-022032-ladsgroup.json
[02:29:34] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:35:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32385 and previous config saved to /var/cache/conftool/dbconfig/20220815-023538-ladsgroup.json
[02:35:43] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[02:46:14] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:58:06] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:28] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:07:16] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:14:46] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:17:08] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:31:06] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:29:06] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:31:22] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:55:13] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[06:12:06] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:16:38] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:18:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil)
[06:20:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:22:30] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:23:08] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:43:32] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:45:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:48:30] <urbanecm>	 jouncebot: next
[06:48:30] <jouncebot>	 In 0 hour(s) and 11 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T0700)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T0700). Please do the needful.
[07:00:05] <jouncebot>	 phuedx, Urbanecm, and MdsShakil: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:23] <urbanecm>	 I can deploy today!
[07:00:28] <Amir1>	 thanks!
[07:01:05] * urbanecm starts with his own patches
[07:01:08] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822701 (https://phabricator.wikimedia.org/T315141) (owner: 10Urbanecm)
[07:01:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822702 (owner: 10Urbanecm)
[07:01:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822725 (https://phabricator.wikimedia.org/T315182) (owner: 10Urbanecm)
[07:02:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822701 (https://phabricator.wikimedia.org/T315141) (owner: 10Urbanecm)
[07:02:09] <wikibugs>	 (03Merged) 10jenkins-bot: throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822702 (owner: 10Urbanecm)
[07:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822725 (https://phabricator.wikimedia.org/T315182) (owner: 10Urbanecm)
[07:03:48] <urbanecm>	 phuedx: MdsShakil: are you around, please? :)
[07:03:56] <MdsShakil>	 Yah
[07:04:20] <MdsShakil>	 urbanecm: 
[07:04:37] <wikibugs>	 (03PS13) 10Urbanecm: Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil)
[07:04:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil)
[07:05:03] <urbanecm>	 hello MdsShakil, thanks. let's do your patch next. Are you familiar with how testing via x-wikimedia-debug works (it's fine if not, i can explain that)?
[07:05:33] <wikibugs>	 (03Merged) 10jenkins-bot: Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil)
[07:06:38] <MdsShakil>	 urbanecm: Previously I worked with mwdebug1001.eqiad, is that same?
[07:06:43] <urbanecm>	 yup
[07:06:53] <urbanecm>	 mwdebug1001 is one of the debug servers
[07:06:55] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 7c2a393ee: dc0d62a3: 6f687bcfc: Update throttle rules (T315182, T315141) (duration: 03m 21s)
[07:06:59] <urbanecm>	 I'll let you know once your patch is there
[07:07:03] <stashbot>	 T315182: Request a throttle lift for Wiki-Editathon – 2022-08-23 - https://phabricator.wikimedia.org/T315182
[07:07:03] <stashbot>	 T315141: Request a throttle lift for Festival of media education – 2022-08-16 - 2022-08-18 - https://phabricator.wikimedia.org/T315141
[07:07:19] <urbanecm>	 MdsShakil: your patch is at mwdebug1001. can you test it there?
[07:08:46] <urbanecm>	 !log mwscript resetAuthenticationThrottle.php --wiki=cswiki --signup --ip='194.31.191.20' # T315141
[07:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:09:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:09:29] <MdsShakil>	 urbanecm: I think all ok
[07:09:29] <wikibugs>	 (03PS5) 10Urbanecm: Pin wgCheckUserLogReasonMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[07:09:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:09:34] <urbanecm>	 MdsShakil: okay, thanks!
[07:09:35] <urbanecm>	 syncing
[07:09:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[07:09:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Pin wgCheckUserLogReasonMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[07:09:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[07:09:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T314041)', diff saved to https://phabricator.wikimedia.org/P32386 and previous config saved to /var/cache/conftool/dbconfig/20220815-070955-ladsgroup.json
[07:09:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:10:00] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[07:10:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:10:31] <wikibugs>	 (03Merged) 10jenkins-bot: Pin wgCheckUserLogReasonMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[07:11:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:11:05] <urbanecm>	 phuedx: B&C window is happening, are you around? :)
[07:13:20] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 43cd5ef1bc38bdc8f46f3093cf0baa74cccc9678: Add bnwiki in wgImportSources to bnwikibooks (T314820) (duration: 03m 05s)
[07:13:24] <stashbot>	 T314820: Add bnwiki in wgImportSources to bnwikibooks - https://phabricator.wikimedia.org/T314820
[07:13:25] <urbanecm>	 MdsShakil: your patch should be live!
[07:13:29] <urbanecm>	 anything else? :)
[07:16:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:16:30] <MdsShakil>	 urbanecm: Thank you. That's working fine on onwiki also.
[07:16:30] <MdsShakil>	 https://w.wiki/5aFK
[07:16:40] <urbanecm>	 great :)
[07:17:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:17:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:17:05] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a454d3bc56c344fa62625f7c292ea087bddfebe5: Pin wgCheckUserLogReasonMigrationStage to read and write old (T233004) (duration: 03m 16s)
[07:17:10] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[07:17:34] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:17:45] <urbanecm>	 !log UTC morning B&C window done
[07:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:23:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:23:58] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 1.14 ms
[07:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:27:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:27:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:31:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:33:30] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: Add euwiki, huwiki & hywiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/823109 (https://phabricator.wikimedia.org/T314456)
[07:39:26] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:52:34] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.79 ms
[07:52:52] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:02] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:02:56] <wikibugs>	 (03PS1) 10Tim Starling: Switch www.mediawiki.org to multi-DC mode [puppet] - 10https://gerrit.wikimedia.org/r/823113
[08:09:09] <wikibugs>	 (03PS2) 10Jbond: Add Cumin aliases for ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/820129 (owner: 10Muehlenhoff)
[08:10:40] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:11:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM but will leave someone from wmcs to merge" [puppet] - 10https://gerrit.wikimedia.org/r/821759 (owner: 10Majavah)
[08:12:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[08:17:04] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms
[08:25:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) Cool, thanks so much for all these explanation!!  >>! In T308799#8115381, @BBlack wrote: > I'm hoping that least some banner outputs will be categorically cac...
[08:28:03] <wikibugs>	 (03CR) 10Jbond: Add names to flow collectors (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi)
[08:32:38] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:48] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:34] <wikibugs>	 (03PS3) 10Ladsgroup: Allow DB config reload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801721 (https://phabricator.wikimedia.org/T298485) (owner: 10Daniel Kinzler)
[08:45:38] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 712.00 ms
[08:46:56] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10EChetty)
[08:50:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10EChetty)
[08:50:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10EChetty)
[08:50:56] <wikibugs>	 (03PS1) 10Jelto: install_server: change partman config for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/823115 (https://phabricator.wikimedia.org/T274463)
[09:01:10] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:02:18] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:11:06] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[09:13:30] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[09:17:26] <wikibugs>	 (03CR) 10Jbond: "took a quick pass, lgtm but see inline for comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[09:17:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.neutron: Add alert to open a task when agent down [alerts] - 10https://gerrit.wikimedia.org/r/822319 (owner: 10David Caro)
[09:18:00] <wikibugs>	 (03CR) 10David Caro: icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro)
[09:18:03] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro)
[09:20:31] <wikibugs>	 (03Merged) 10jenkins-bot: icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro)
[09:20:33] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs.neutron: Add alert to open a task when agent down [alerts] - 10https://gerrit.wikimedia.org/r/822319 (owner: 10David Caro)
[09:21:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] admin: Move soworu-01 from ldap-only to analytics [puppet] - 10https://gerrit.wikimedia.org/r/822680 (https://phabricator.wikimedia.org/T313213) (owner: 10BCornwall)
[09:27:44] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM code-wise. Small stylistic nits on trailing commas for consistency's sake, up to you." [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840) (owner: 10Giuseppe Lavagetto)
[09:55:13] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[10:00:06] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms
[10:01:43] <wikibugs>	 10SRE-swift-storage, 10ops-codfw: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10MatthewVernon)
[10:03:45] <Emperor>	 !log pd 1I:1:1 modify disablepd forced on ms-be2028 T315213
[10:03:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:49] <stashbot>	 T315213: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213
[10:18:05] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) @Papaul could you take another look at this, please, and see if we can get the replacement disk to be visible to the RAID controller?  Also, I don't know if it's possible that `/d...
[10:20:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:21:35] <phuedx>	 urbanecm: Doh. Sorry I missed the ping. I've rescheduled the patch for later today
[10:22:14] <urbanecm>	 phuedx: no worries, happens from time to time :)
[10:26:06] <wikibugs>	 (03PS3) 10Hnowlan: Create basic haproxy container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196)
[10:26:20] <wikibugs>	 (03CR) 10Hnowlan: Create basic haproxy container (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[10:27:14] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[10:28:39] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: move swift ring manager repo [puppet] - 10https://gerrit.wikimedia.org/r/822659 (owner: 10MVernon)
[10:34:28] <icinga-wm>	 PROBLEM - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[10:34:30] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T315216 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[10:34:35] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T315216 (10ops-monitoring-bot)
[10:40:12] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[10:44:19] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T315216 (10MatthewVernon)
[10:44:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10MatthewVernon)
[10:50:13] <wikibugs>	 (03PS1) 10Clément Goubert: scripts/run_ci_locally.sh: Fix arm Mac docker platform warning [puppet] - 10https://gerrit.wikimedia.org/r/823122
[10:58:13] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[11:06:03] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:12:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10dcaro)
[11:13:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10dcaro)
[11:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:29:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10Peachey88)
[11:30:53] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[11:35:17] <wikibugs>	 (03PS1) 10Stang: testwikidatawiki: Add wikidata as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823141 (https://phabricator.wikimedia.org/T315211)
[11:59:45] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196)
[12:01:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:03:03] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:45] <wikibugs>	 (03PS2) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196)
[12:12:27] <wikibugs>	 (03PS1) 10Stang: jawiki: Restrict abusefilter log view to "abusefilter-modify" user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823148 (https://phabricator.wikimedia.org/T315199)
[12:21:01] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add roles and cumin aliases for the new dse_k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis)
[12:21:15] <wikibugs>	 (03PS5) 10Majavah: P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665)
[12:23:53] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[12:45:31] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Nice yeah, this was used for server side EventLogging extension to send events.  Pretty sure we've migrated all server side usages. Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) (owner: 10Krinkle)
[12:46:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Ottomata) Approved! Thank you!
[12:46:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Ottomata) Approved!
[12:49:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) Adding Event Platform tag, we decided to get this hardware to hopefully better support multi DC event stream processing.
[12:53:52] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T1300).
[13:00:05] <jouncebot>	 phuedx and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <phuedx>	 o/
[13:00:19] <koi>	 o/
[13:00:44] <urbanecm>	 I can deploy today
[13:00:59] <wikibugs>	 (03PS3) 10Urbanecm: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx)
[13:01:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx)
[13:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx)
[13:03:15] <urbanecm>	 phuedx: your patch is at mwdebug1001, can you check please?
[13:03:18] <phuedx>	 On it
[13:04:12] <wikibugs>	 (03PS2) 10Urbanecm: testwikidatawiki: Add wikidata as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823141 (https://phabricator.wikimedia.org/T315211) (owner: 10Stang)
[13:04:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] testwikidatawiki: Add wikidata as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823141 (https://phabricator.wikimedia.org/T315211) (owner: 10Stang)
[13:04:31] <phuedx>	 urbanecm: I've confirmed that the fully-qualified schema name is still being sent to wikidatawiki but not to, say, enwiki
[13:04:32] <phuedx>	 LGTM
[13:04:49] <urbanecm>	 okay, syncing!
[13:05:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:05:17] <wikibugs>	 (03Merged) 10jenkins-bot: testwikidatawiki: Add wikidata as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823141 (https://phabricator.wikimedia.org/T315211) (owner: 10Stang)
[13:06:29] <wikibugs>	 (03PS6) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[13:07:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[13:08:11] <icinga-wm>	 PROBLEM - MegaRAID on db2110 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:08:13] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db2110 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T315229 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:08:17] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10ops-monitoring-bot)
[13:08:19] <wikibugs>	 (03PS7) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[13:08:36] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e2772238003b797b1a8b18b4df0aa56f54132727: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" (T290303) (duration: 03m 29s)
[13:08:39] <stashbot>	 T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303
[13:08:42] <urbanecm>	 phuedx: should be live!
[13:08:56] <phuedx>	 <3 Thanks. I'll check again to be sure
[13:08:57] <urbanecm>	 koi: your first patch is at mwdebug1001, can you check please?
[13:09:01] <koi>	 looking
[13:09:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:09:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:09:48] <phuedx>	 Everything looks as expected :)
[13:09:54] <urbanecm>	 glad to hear that!
[13:10:02] <koi>	 urbanecm: works as expected, LGTM
[13:10:07] <urbanecm>	 thanks, syncing
[13:12:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:12:47] <wikibugs>	 (03CR) 10Jelto: "Thanks for the review! I added a new patchset and commented in-line" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo)
[13:13:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "Hello," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823148 (https://phabricator.wikimedia.org/T315199) (owner: 10Stang)
[13:13:49] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: de81bcb5874aee16b23ffea5a43466572250a6c2: testwikidatawiki: Add wikidata as import source (T315211) (duration: 03m 26s)
[13:13:53] <stashbot>	 T315211: Enable transwiki import from Wikidata to Testwikidata - https://phabricator.wikimedia.org/T315211
[13:14:31] <urbanecm>	 koi: I -2'ed the other change, because I strongly doubt it will have the benefit the ja.wikipedia community expects. I'll explain in more details on the task itself, the -2 is just there to ensure it's not merged before ja.wiki reaches an informed decision.
[13:14:50] <koi>	 got it, thanks for the explaination
[13:14:56] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10MatthewVernon)
[13:15:02] <urbanecm>	 to summary, the issue is that the same information is (and will continue to be) visible via quarry.wmcloud.org and similar
[13:15:46] <urbanecm>	 the other patch is live koi
[13:16:09] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10MatthewVernon)
[13:16:22] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon)
[13:16:29] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[13:17:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:17:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10MatthewVernon) a:03Papaul
[13:19:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:19:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:20:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:20:53] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10MatthewVernon)
[13:21:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon)
[13:23:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[13:28:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) I'm putting these hosts back into 'insetup' pending hdfs packages on bullseye T310643
[13:29:33] <wikibugs>	 (03CR) 10Samtar: logos/manage.py: Use shortened link in user agent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821246 (owner: 10Samtar)
[13:29:34] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[13:29:39] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[13:30:21] <wikibugs>	 (03PS1) 10Andrew Bogott: clouddumps100[12]: move back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/823155 (https://phabricator.wikimedia.org/T302981)
[13:32:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] clouddumps100[12]: move back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/823155 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[13:33:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Create basic haproxy container (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[13:33:59] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10Papaul) This server is out of warranty and iIdon't have any disk onsite for replacement
[13:34:20] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1070.eqiad.wmnet with OS bullseye
[13:34:26] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1070.eqiad.wmnet with OS bullseye
[13:37:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) Please power down this server so i can disconnect the battery and connect it back . Note server is out of warranty.
[13:37:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) Please power down this server so i can disconnect the battery and connect it back . Note server is out of warranty.
[13:38:35] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:45] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:32] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack: update control nodes after refresh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822735 (owner: 10David Caro)
[13:46:44] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1070.eqiad.wmnet with reason: host reimage
[13:47:02] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Check for already published images before pushing [debs/calico] - 10https://gerrit.wikimedia.org/r/654637 (owner: 10JMeybohm)
[13:49:24] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1070.eqiad.wmnet with reason: host reimage
[13:52:55] <wikibugs>	 (03PS1) 10JMeybohm: Update to v3.20.6 [debs/calico] (v3.20) - 10https://gerrit.wikimedia.org/r/823159 (https://phabricator.wikimedia.org/T307943)
[13:53:36] <wikibugs>	 (03Merged) 10jenkins-bot: openstack: update control nodes after refresh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822735 (owner: 10David Caro)
[13:55:13] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[14:03:52] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[14:05:30] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1070.eqiad.wmnet with OS bullseye
[14:05:35] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1070.eqiad.wmnet with OS bullseye completed: - elastic1070 (...
[14:05:42] <wikibugs>	 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops, 10Community-Tech (CommTech-Sprint-31): SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10JMcLeod_WMF)
[14:10:38] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure
[14:10:51] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@a571f9a]: Add blwiki T310874
[14:10:53] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure
[14:10:54] <stashbot>	 T310874: Add blkwiki to RESTBase - https://phabricator.wikimedia.org/T310874
[14:10:57] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d5e8a120-d0d2-4934-9013-0e0723fbb808) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services wit...
[14:11:59] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) I've powered this down; if you LMK once it's back up I can then shutdown ms-be2035.
[14:19:09] <wikibugs>	 (03PS2) 10Ssingh: dnsrecursor: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff)
[14:20:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:20:07] <wikibugs>	 (03CR) 10Ssingh: dnsrecursor: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff)
[14:20:56] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36738/console" [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff)
[14:23:36] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[14:23:59] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1068.eqiad.wmnet with OS bullseye
[14:24:16] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1068.eqiad.wmnet with OS bullseye
[14:24:52] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:33] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@a571f9a]: Add blwiki T310874 (duration: 15m 42s)
[14:26:38] <stashbot>	 T310874: Add blkwiki to RESTBase - https://phabricator.wikimedia.org/T310874
[14:34:30] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Event-Platform Value Stream, 10serviceops, 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata)
[14:34:41] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] Create basic haproxy container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:35:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::cumin::target: remove port forwarding support [puppet] - 10https://gerrit.wikimedia.org/r/821759 (owner: 10Majavah)
[14:36:23] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1068.eqiad.wmnet with reason: host reimage
[14:38:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro)
[14:39:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs.quota_increase: fix not needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822736 (owner: 10David Caro)
[14:39:54] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1068.eqiad.wmnet with reason: host reimage
[14:44:15] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:47:05] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs.quota_increase: fix not needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822736 (owner: 10David Caro)
[14:51:00] <wikibugs>	 (03Abandoned) 10Jforrester: inEventSample: Avoid invalid character warning from sampling code, hash into hex [extensions/WikiEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821745 (https://phabricator.wikimedia.org/T314896) (owner: 10Jforrester)
[14:55:37] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 16.92 ms
[14:56:25] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[14:59:41] <wikibugs>	 (03PS3) 10Jbond: O:wikidough: drop wikidough abuse nets [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845)
[14:59:46] <TheresNoTime>	 urbanecm: question if you're about ref https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/821249, "extension needs to be present in at least two trains to be addable to extension-list" - *why* does it break scap otherwise? (:
[14:59:58] <TheresNoTime>	 (curious more than anything)
[15:01:08] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1068.eqiad.wmnet with OS bullseye
[15:01:15] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1068.eqiad.wmnet with OS bullseye completed: - elastic1070 (...
[15:01:23] <urbanecm>	 TheresNoTime: because scap uses the extension list to build i18n cache in both production and labs
[15:01:31] <urbanecm>	 and in any stage of the train, we can go one train back
[15:01:44] <TheresNoTime>	 ah so by having two, it'll definitely be there?
[15:01:59] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro)
[15:02:01] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:02:27] <urbanecm>	 tbh, I'm not 100% sure if it needs to be "at least two", or "all wmf branches that are present at deployment host". 
[15:02:40] <wikibugs>	 (03PS2) 10David Caro: pytest: fix warning about marks not defined [alerts] - 10https://gerrit.wikimedia.org/r/822312
[15:03:00] <urbanecm>	 (since train deployment includes purging old wmf branches from deplyment host, it doesn't matter much, but still)
[15:06:19] <urbanecm>	 TheresNoTime: scap rebuilds i18n cache for all versions mentioned in wikiversions.json. and because we only rollback one train back, it's essentially "latest two trains"
[15:06:34] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be2032 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T315235 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gat
[15:06:36] <urbanecm>	 17:01 <TheresNoTime> ah so by having two, it'll definitely be there? <== so, yes.
[15:06:38] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T315235 (10ops-monitoring-bot)
[15:06:40] <wikibugs>	 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) @herron is is okay for me to power this server down so i can reset the IDRAC and upgrade it?
[15:06:45] <TheresNoTime>	 urbanecm: makes sense, thank you :)
[15:06:50] <urbanecm>	 any time!
[15:07:03] <icinga-wm>	 PROBLEM - Check systemd state on elastic1068 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:51] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:08:31] <wikibugs>	 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10herron) @Papaul yes please proceed
[15:12:45] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T315235 (10Papaul) 05Open→03Declined Duplicate of T314427
[15:12:55] <wikibugs>	 (03PS1) 10David Caro: ceph: rename CephOSDController to CephOSDNodeController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823168
[15:12:57] <wikibugs>	 (03PS1) 10David Caro: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169
[15:16:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10BTullis) @Andrew - I believe that the hadoop-client package and any others on which this work depends have now been packaged and...
[15:16:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] logos/manage.py: Use shortened link in user agent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821246 (owner: 10Samtar)
[15:17:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno)
[15:21:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) disconnecting the battery didn't fix the issue. So the servers needs to be decom or buy a new battery no need to power down ms-be2035. This server is back online
[15:22:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) I  need this server depool so i can shut it down to work on this disk issue.
[15:22:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph: rename CephOSDController to CephOSDNodeController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823168 (owner: 10David Caro)
[15:24:43] <icinga-wm>	 RECOVERY - Check systemd state on elastic1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:49] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add git-review package to profile::mediawiki::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/823173
[15:25:57] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms
[15:26:00] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] maintain-views: Add pagetriage-copyvio to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/815215 (https://phabricator.wikimedia.org/T313281) (owner: 10Zabe)
[15:26:20] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul)
[15:26:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) 05Open→03Resolved Complete
[15:26:56] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul)
[15:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:27:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10MPhamWMF)
[15:28:35] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul)
[15:29:00] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) 05Open→03Resolved This is complete Closing it, I will open another task for A1 and A8 once I receive the PDU's
[15:29:11] <wikibugs>	 (03PS2) 10Ahmon Dancy: Add git-review package to profile::mediawiki::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/823173
[15:30:04] <jouncebot>	 jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T1530).
[15:31:05] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2032.codfw.wmnet
[15:31:05] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2032.codfw.wmnet
[15:31:30] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure
[15:31:44] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure
[15:31:49] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2032.codfw.wmnet
[15:31:50] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2032.codfw.wmnet
[15:31:50] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0b0cee87-305c-4cd2-acf0-ac3d3f5b8587) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services wit...
[15:32:10] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: disk fault investigation
[15:32:24] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: disk fault investigation
[15:32:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=51e5c728-88e2-4e83-acf6-7e651f6e7d29) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services wit...
[15:33:31] <icinga-wm>	 RECOVERY - Host logstash2003 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms
[15:34:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) @Papaul I've shut ms-be2067 down for you to work on it. [ignore the downtime on ms-be2032 here, that was a typo]
[15:35:37] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) Ah, OK. These hosts are scheduled for decom (but the cluster needs to be healthy enough for the rebalancing necessary for that to proceed).
[15:35:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts logstash2003.codfw.wmnet
[15:36:16] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] admin: Move soworu-01 from ldap-only to analytics [puppet] - 10https://gerrit.wikimedia.org/r/822680 (https://phabricator.wikimedia.org/T313213) (owner: 10BCornwall)
[15:36:27] <wikibugs>	 (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/36739/" [puppet] - 10https://gerrit.wikimedia.org/r/823173 (owner: 10Ahmon Dancy)
[15:38:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall)
[15:38:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) a:05Ottomata→03BCornwall
[15:39:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) 05In progress→03Resolved Your permissions have been changed and will go into effect after a short period. I'm closing this ticket now but...
[15:39:22] <wikibugs>	 (03PS1) 10MVernon: swift: ms-be2028 /dev/sdg1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/823178 (https://phabricator.wikimedia.org/T315213)
[15:41:11] <wikibugs>	 (03PS2) 10MVernon: swift: ms-be2028 /dev/sdg1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/823178 (https://phabricator.wikimedia.org/T315213)
[15:41:23] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 04-1] "Thanks for the review Dzahn.  I'll make adjustments to address your comments." [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[15:43:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts logstash2003.codfw.wmnet
[15:44:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @bcampbell, @TAndic, and I had a meeting to try and suss out how this was working previously. As far as we were able to ascertain it appea...
[15:45:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10Gehel) We can confirm things are working from the Search Platform point of view. No more work for the Search Platform team, so unassigning...
[15:45:07] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks for all the work on this!" [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845) (owner: 10Jbond)
[15:45:09] <wikibugs>	 (03PS2) 10BryanDavis: Add missing attrs dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/822734
[15:48:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10MPhamWMF)
[15:48:35] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:26] <wikibugs>	 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) 05Open→03Resolved upgrade IDRAC  from 3.21.21.21 5.10.30.00 @herron server is back up thanks
[15:50:40] <wikibugs>	 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10herron) Looks much better thank you!
[15:53:21] <icinga-wm>	 PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:58:13] <icinga-wm>	 RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops
[16:01:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:01:37] <wikibugs>	 10SRE, 10ops-codfw, 10Gerrit, 10decommission-hardware, and 2 others: decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Papaul)
[16:02:14] <wikibugs>	 10SRE, 10ops-codfw, 10Gerrit, 10decommission-hardware, and 2 others: decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Papaul) 05Open→03Resolved complete
[16:05:52] <wikibugs>	 (03PS1) 10Jbond: P:redis::slave: pass the password [puppet] - 10https://gerrit.wikimedia.org/r/823181 (https://phabricator.wikimedia.org/T228266)
[16:05:54] <wikibugs>	 (03PS1) 10Jbond: P:redis::slave: drop use of inline_template [puppet] - 10https://gerrit.wikimedia.org/r/823182
[16:09:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:10:33] <wikibugs>	 10SRE, 10Traffic-Icebox: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10BCornwall) a:03BCornwall
[16:14:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:17:01] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.13.0" for 553 hosts
[16:17:23] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.13.0" completed for 553 hosts
[16:18:30] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol2003-dev.wikimedia.org - https://phabricator.wikimedia.org/T315089 (10Papaul)
[16:22:07] <wikibugs>	 (03PS4) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915
[16:23:23] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1082.eqiad.wmnet with OS bullseye
[16:23:31] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1082.eqiad.wmnet with OS bullseye
[16:23:48] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[16:24:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro)
[16:24:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add git-review package to profile::mediawiki::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/823173 (owner: 10Ahmon Dancy)
[16:25:37] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet
[16:26:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/Package[git-review]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/823173 (owner: 10Ahmon Dancy)
[16:27:10] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet
[16:28:10] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/823173 (owner: 10Ahmon Dancy)
[16:28:25] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:54] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1082.eqiad.wmnet with reason: host reimage
[16:36:06] <icinga-wm>	 PROBLEM - confd service on sretest1001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:37:31] <wikibugs>	 (03CR) 10Andrew Bogott: wmcs: changes to api service to manage toolforge replica.my.cnf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[16:39:32] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1082.eqiad.wmnet with reason: host reimage
[16:53:29] <wikibugs>	 (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[16:55:17] <icinga-wm>	 RECOVERY - confd service on sretest1001 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:00:35] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1082.eqiad.wmnet with OS bullseye
[17:00:46] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1082.eqiad.wmnet with OS bullseye completed: - elastic1070 (...
[17:01:59] <icinga-wm>	 PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:07:23] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol2003-dev.wikimedia.org - https://phabricator.wikimedia.org/T315089 (10Papaul) 05Open→03Resolved complete
[17:09:23] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[17:09:27] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[17:09:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:10:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:10:51] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[17:10:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:11:05] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[17:11:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:11:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:11:09] <sukhe>	 hmm
[17:11:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:11:43] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[17:12:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:14:03] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[17:15:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:15:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:15:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:15:47] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[17:15:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:15:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:15:55] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[17:15:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:17:47] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1052.eqiad.wmnet with OS bullseye
[17:17:54] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1052.eqiad.wmnet with OS bullseye
[17:18:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:18:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:18:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:18:59] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[17:19:21] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[17:20:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:21:15] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[17:21:17] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[17:21:35] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[17:22:08] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@d4137b5]: increase subgraph query SLA and remove same from drop_old_data
[17:22:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:23:07] <icinga-wm>	 RECOVERY - Check systemd state on elastic1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:23:31] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[17:24:26] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@d4137b5]: increase subgraph query SLA and remove same from drop_old_data (duration: 02m 17s)
[17:25:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:25:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:25:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:26:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:27:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:27:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:27:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:27:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:28:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2067.codfw.wmnet
[17:28:15] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2067.codfw.wmnet
[17:28:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:28:40] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1052.eqiad.wmnet with reason: host reimage
[17:28:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2067.codfw.wmnet
[17:29:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:29:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet
[17:30:06] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10Papaul) @Marostegui this server is out for warranty and I don't have any 1.9TB SSD disk onsite.   Thanks
[17:32:36] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[17:32:48] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1052.eqiad.wmnet with reason: host reimage
[17:41:01] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) I requested for another disk to be sent to me. The server is back up `  Create Dispatch: Success You have successfully submitted request SR148961821.
[17:41:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) Give it a  minute i am upgrading  the BIOS on it
[17:45:52] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[17:47:30] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[17:48:51] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1052.eqiad.wmnet with OS bullseye
[17:48:59] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1052.eqiad.wmnet with OS bullseye completed: - elastic1070 (...
[17:55:13] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[18:01:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10wiki_willy) a:03Cmjohnson
[18:04:02] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:04:16] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:05:32] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:06:08] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:07:21] <herron>	 !log thanos compact process was hung, forced thanos-compact restart on thanos-fe2001
[18:07:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:39] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Create the OpenSSH directory inside the rancid home directory [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse)
[18:09:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:09:58] <jinxer-wm>	 (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[18:15:08] <wikibugs>	 (03CR) 10Andrea Denisse: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[18:15:22] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] tcpircbot: send tcpircbot logs to centralized logging [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[18:16:51] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1081.eqiad.wmnet with OS bullseye
[18:16:57] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1081.eqiad.wmnet with OS bullseye
[18:19:54] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/822421 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[18:20:57] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822450 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[18:22:01] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[18:23:06] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite)
[18:23:24] <wikibugs>	 (03PS5) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295)
[18:24:34] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2067.codfw.wmnet
[18:28:40] <wikibugs>	 10SRE, 10conftool, 10Patch-For-Review: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jhathaway) In my brief testing it appears that ferm is happy to accept bogus IPs and try to load them in with iptables, leaving the box with no rules at all. @jbond how confident are we that e...
[18:28:49] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[18:29:13] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1081.eqiad.wmnet with reason: host reimage
[18:30:04] <wikibugs>	 (03PS1) 10Gergő Tisza: WelcomeSurvey/VariantHooks: Change hook used for redirection [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064)
[18:31:33] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ms-be2067.codfw.wmnet
[18:33:15] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1081.eqiad.wmnet with reason: host reimage
[18:34:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "clouddumps100[12]: move back to 'insetup'" [puppet] - 10https://gerrit.wikimedia.org/r/823199 (https://phabricator.wikimedia.org/T302981)
[18:35:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "clouddumps100[12]: move back to 'insetup'" [puppet] - 10https://gerrit.wikimedia.org/r/823199 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[18:35:56] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 93 probes of 682 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:38:01] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@230a820]: include additional deubgging information in HivePartitionRangeSensor logs
[18:40:10] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@230a820]: include additional deubgging information in HivePartitionRangeSensor logs (duration: 02m 08s)
[18:44:52] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[18:46:28] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 86 probes of 682 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:47:57] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. Brian_King cluster reimage ongoing, this is expected https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[18:47:58] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[18:48:18] <wikibugs>	 (03PS1) 10Andrew Bogott: acme_chief: give cloudstore100[12] access to the dumps certs [puppet] - 10https://gerrit.wikimedia.org/r/823200 (https://phabricator.wikimedia.org/T302981)
[18:48:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WelcomeSurvey/VariantHooks: Change hook used for redirection [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza)
[18:49:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: give cloudstore100[12] access to the dumps certs [puppet] - 10https://gerrit.wikimedia.org/r/823200 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[18:49:38] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1081.eqiad.wmnet with OS bullseye
[18:49:45] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1081.eqiad.wmnet with OS bullseye completed: - elastic1070 (...
[18:50:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T314041)', diff saved to https://phabricator.wikimedia.org/P32387 and previous config saved to /var/cache/conftool/dbconfig/20220815-185002-ladsgroup.json
[18:50:05] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[18:50:16] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:55:00] <wikibugs>	 (03PS1) 10Andrew Bogott: acme_chief: give clouddumps100[12] access to the dumps certs [puppet] - 10https://gerrit.wikimedia.org/r/823201 (https://phabricator.wikimedia.org/T302981)
[18:56:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: give clouddumps100[12] access to the dumps certs [puppet] - 10https://gerrit.wikimedia.org/r/823201 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[19:03:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:05:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P32388 and previous config saved to /var/cache/conftool/dbconfig/20220815-190508-ladsgroup.json
[19:06:16] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:06:49] <icinga-wm>	 PROBLEM - NFS on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 2049: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore
[19:08:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:12:02] <icinga-wm>	 PROBLEM - NFS on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 2049: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore
[19:14:50] <wikibugs>	 (03CR) 10Gergő Tisza: "recheck: Selenium timeout in Termbox tests, seems unrelated" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza)
[19:18:49] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[19:20:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P32389 and previous config saved to /var/cache/conftool/dbconfig/20220815-192014-ladsgroup.json
[19:25:00] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms
[19:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:34:47] <wikibugs>	 (03PS1) 10Andrew Bogott: Give clouddumps100[12] access to hdfs and rsync things [puppet] - 10https://gerrit.wikimedia.org/r/823208 (https://phabricator.wikimedia.org/T302981)
[19:35:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T314041)', diff saved to https://phabricator.wikimedia.org/P32390 and previous config saved to /var/cache/conftool/dbconfig/20220815-193520-ladsgroup.json
[19:35:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[19:35:26] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[19:35:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:35:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[19:35:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T314041)', diff saved to https://phabricator.wikimedia.org/P32391 and previous config saved to /var/cache/conftool/dbconfig/20220815-193541-ladsgroup.json
[19:36:26] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add placeholder for scap's phabricator API token [labs/private] - 10https://gerrit.wikimedia.org/r/823209 (https://phabricator.wikimedia.org/T315255)
[19:36:38] <wikibugs>	 10SRE, 10Acme-chief, 10Cloud-VPS, 10Traffic-Icebox, 10cloud-services-team (Kanban): acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10BCornwall) a:05Vgutierrez→03BCornwall
[19:38:18] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:38:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Give clouddumps100[12] access to hdfs and rsync things [puppet] - 10https://gerrit.wikimedia.org/r/823208 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[19:40:18] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] Add placeholder for scap's phabricator API token [labs/private] - 10https://gerrit.wikimedia.org/r/823209 (https://phabricator.wikimedia.org/T315255) (owner: 10Ahmon Dancy)
[19:40:32] <wikibugs>	 10SRE, 10Acme-chief, 10Cloud-VPS, 10Traffic-Icebox, 10cloud-services-team (Kanban): acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10BCornwall) I believe that https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/820795 will als...
[19:40:32] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:40:57] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10nshahquinn-wmf)
[19:41:41] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10nshahquinn-wmf)
[19:42:54] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10nshahquinn-wmf) @SWakiyama as Purity's manager, please approve her to access private data.
[19:44:01] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10nshahquinn-wmf) @odimitrijevic or @Ottomata, can you approve Purity for LDAP-only membership in `analytics-privatedata-users`?
[19:47:28] <wikibugs>	 (03PS1) 10Ahmon Dancy: profile::mediawiki::deployment::server: Populate /etc/scap/phabricator_token [puppet] - 10https://gerrit.wikimedia.org/r/823210 (https://phabricator.wikimedia.org/T315255)
[19:50:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: use logstash routing for w3creportingapi stream [puppet] - 10https://gerrit.wikimedia.org/r/822450 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[19:51:30] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:52:20] <icinga-wm>	 RECOVERY - k8s requests count to the API on ml-serve-ctrl2002 is OK: (C)100 ge (W)50 ge 32.44 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[19:57:37] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10Ottomata) Approved.
[19:58:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] profile::mediawiki::deployment::server: Populate /etc/scap/phabricator_token [puppet] - 10https://gerrit.wikimedia.org/r/823210 (https://phabricator.wikimedia.org/T315255) (owner: 10Ahmon Dancy)
[19:58:18] <icinga-wm>	 PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T2000).
[20:00:04] <jouncebot>	 cjming and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:12] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:00:23] <cjming>	 i'll deploy o/
[20:00:27] <wikibugs>	 (03PS1) 10Ahmon Dancy: profile::mediawiki::deployment::server: Ensure that /etc/scap directory exists [puppet] - 10https://gerrit.wikimedia.org/r/823211
[20:00:36] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) (owner: 10Clare Ming)
[20:01:15] <cjming>	 tgr: do you usually deploy your own patches? happy to do yours if you're around
[20:01:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::mediawiki::deployment::server: Ensure that /etc/scap directory exists [puppet] - 10https://gerrit.wikimedia.org/r/823211 (owner: 10Ahmon Dancy)
[20:01:41] <wikibugs>	 (03Merged) 10jenkins-bot: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) (owner: 10Clare Ming)
[20:02:35] <tgr>	 cjming: thanks! usually one person does all patches (it's slightly faster that way)
[20:03:02] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] WelcomeSurvey/VariantHooks: Change hook used for redirection [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza)
[20:04:52] <cjming>	 tgr: sounds good
[20:05:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:05:55] <wikibugs>	 (03PS2) 10Ahmon Dancy: profile::mediawiki::deployment::server: Move /etc/scap/phabricator_token to class scap::master [puppet] - 10https://gerrit.wikimedia.org/r/823211 (https://phabricator.wikimedia.org/T315255)
[20:08:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::mediawiki::deployment::server: Move /etc/scap/phabricator_token to class scap::master [puppet] - 10https://gerrit.wikimedia.org/r/823211 (https://phabricator.wikimedia.org/T315255) (owner: 10Ahmon Dancy)
[20:09:40] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:10:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:10:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:10:12] <wikibugs>	 (03PS3) 10Ahmon Dancy: profile::mediawiki::deployment::server: Move stuff to class scap::master [puppet] - 10https://gerrit.wikimedia.org/r/823211 (https://phabricator.wikimedia.org/T315255)
[20:11:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:11:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36740/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/823211 (https://phabricator.wikimedia.org/T315255) (owner: 10Ahmon Dancy)
[20:12:09] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:821310|Enable sticky header edit A/B test for idwiki + viwiki (T312295)]] (duration: 03m 30s)
[20:12:12] <stashbot>	 T312295: Enable sticky header A/B test for idwiki + viwiki - https://phabricator.wikimedia.org/T312295
[20:15:00] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:21:45] <wikibugs>	 (03Merged) 10jenkins-bot: WelcomeSurvey/VariantHooks: Change hook used for redirection [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza)
[20:22:50] <cjming>	 tgr: your patch is up on mwdebug1002 - can you test?
[20:22:55] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) 05Open→03Resolved This is complete
[20:23:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[20:26:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:26:29] <tgr>	 cjming: thanks, works
[20:26:36] <cjming>	 great - syncing
[20:28:22] <Tamzin>	 [5ce9b247-7a74-408b-ae14-c047f2e79585] 2022-08-15 20:28:09: Fatal exception of type "TypeError"
[20:28:27] <Tamzin>	 twice, on trying to load a user's contribs
[20:28:31] <Tamzin>	 with one good load in between
[20:28:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:28:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:29:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:30:18] <jinxer-wm>	 (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:30:43] <sukhe>	 here
[20:31:13] <sukhe>	 I ACKed it for now
[20:31:16] <sukhe>	 but I am still looking
[20:31:28] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments: Backport: [[gerrit:822485|WelcomeSurvey/VariantHooks: Change hook used for redirection (T313064)]] (duration: 04m 37s)
[20:31:31] <stashbot>	 T313064: WelcomeSurvey: Post-login redirect hooks might interfere with central login during signup - https://phabricator.wikimedia.org/T313064
[20:31:39] <cjming>	 tgr: should be live!
[20:31:41] <tgr>	 Tamzin: I don't see anything like that in the logs
[20:31:49] <tgr>	 cjming: thanks!
[20:31:54] <Tamzin>	 was on trying to load https://en.wikipedia.org/wiki/Special:Contributions/Smoking_Ethel
[20:32:02] <jinxer-wm>	 (ProbeDown) firing: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:32:07] <Tamzin>	 a couple other people mentioned slow loads / exceptions to me
[20:32:23] <TheresNoTime>	 cjming: where are we in the deploy?
[20:32:32] <tgr>	 The page loads fine for me.
[20:32:40] <cjming>	 we are done! just about to close the window
[20:32:52] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[20:32:57] <Tamzin>	 it loads a bit slow for me, and then some scripts/gadgets don't load
[20:33:00] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:33:00] <sukhe>	 ok so something is definitely up
[20:33:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:33:03] <cjming>	 !log end of UTC late backport window
[20:33:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:30] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms
[20:33:32] <tgr>	 Though indeed surprisingly slow for such a small contributions list.
[20:34:55] <TheresNoTime>	 tgr: sukhe https://phabricator.wikimedia.org/T315260
[20:35:12] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[20:35:14] <cjming>	 actually - is it ok if i deploy one more change?
[20:35:18] <jinxer-wm>	 (ProbeDown) resolved: (8) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:35:19] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1080.eqiad.wmnet with OS bullseye
[20:35:20] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:35:21] <sukhe>	 TheresNoTime: thanks, looking
[20:35:23] <zabe>	 The sync itself could explain the fatals. The Growthexperiments patch was very scap unfriendly and caused ~50.000 errors and exceptions while it was synced
[20:35:26] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1080.eqiad.wmnet with OS bullseye
[20:35:34] <TheresNoTime>	 cjming: we just had a deploy for WelcomeSurvey no?
[20:35:45] <cjming>	 TheresNoTime: yes
[20:35:55] <zabe>	 TheresNoTime, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/822485/
[20:36:36] <jinxer-wm>	 (ProbeDown) resolved: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:36:54] <TheresNoTime>	 considering we're getting resolves, maybe it was just the sync zabe ?
[20:36:55] <tgr>	 TheresNoTime: sorry about that. I probably missed an inter-file dependency in the patch.
[20:37:02] <tgr>	 It should be transient though.
[20:37:15] <zabe>	 yes
[20:37:46] <zabe>	 when scaping such changes the files arrive at a random order and for a short period some files are the new version while others are still the old
[20:38:46] <zabe>	 so for that time period it tried passing a SpecialPageFactory while the old version of WelcomeSurveyHooks.php was still in place which did not accept a SpecialPageFactory resulting in that fatal
[20:39:19] <TheresNoTime>	 we love to see it
[20:40:32] <zabe>	 I wonder why scap did not abort the sync with such an error rate
[20:42:00] <icinga-wm>	 RECOVERY - NFS on clouddumps1002 is OK: TCP OK - 0.000 second response time on 208.80.154.71 port 2049 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore
[20:42:35] <icinga-wm>	 RECOVERY - NFS on clouddumps1001 is OK: TCP OK - 0.000 second response time on 208.80.154.142 port 2049 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore
[20:42:52] <zabe>	 or does scap only look at the error rate after the patch was synced to the canaries?
[20:43:44] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:44:20] <cjming>	 TheresNoTime: i need to revert my config patch from earlier - is it ok to do that now?
[20:44:49] <TheresNoTime>	 cjming: looks good to me, but sukhe is the responding here :)
[20:45:16] <cjming>	 sukhe: ok if i sneak in a quick revert?
[20:45:51] <sukhe>	 it resolved itself so no issues from SRE's end at least. thanks
[20:46:31] <wikibugs>	 (03PS1) 10Clare Ming: Revert "Enable sticky header edit A/B test for idwiki + viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823227
[20:46:32] <tgr>	 zabe: WelcomeSurveyHooks has a SpecialPage_initList handler which is called on every request, so the canaries should have errored out.
[20:47:10] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:48:18] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1080.eqiad.wmnet with reason: host reimage
[20:48:34] <zabe>	 ok
[20:49:32] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Revert "Enable sticky header edit A/B test for idwiki + viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823227 (owner: 10Clare Ming)
[20:50:26] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable sticky header edit A/B test for idwiki + viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823227 (owner: 10Clare Ming)
[20:50:58] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1080.eqiad.wmnet with reason: host reimage
[20:53:13] <wikibugs>	 (03PS1) 10Andrew Bogott: clouddumps: don't puppetized '/srv/dumps' mount [puppet] - 10https://gerrit.wikimedia.org/r/823217 (https://phabricator.wikimedia.org/T302981)
[20:54:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:55:18] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823227|Revert "Enable sticky header edit A/B test for idwiki + viwiki"]] (duration: 03m 15s)
[20:55:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:55:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:56:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:58:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] clouddumps: don't puppetized '/srv/dumps' mount [puppet] - 10https://gerrit.wikimedia.org/r/823217 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[20:59:27] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T2100).
[21:00:39] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[21:02:42] <wikibugs>	 (03PS1) 10Clare Ming: Sticky header AB test bucketing for 2 treatment buckets [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/823228 (https://phabricator.wikimedia.org/T312573)
[21:02:51] <Tamzin>	 should this be fully fixed now, or is there residual stuff? i'm still having script issues. namely NavPopups isn't loading
[21:04:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:04:58] <cjming>	 does anyone mind if i deploy 2 more quick changes? or are you all in the middle of something
[21:06:18] <tgr>	 It should be fully fixed.
[21:07:26] <Tamzin>	 hmm. anyone else here use NavPopups and able to reproduce/not?
[21:07:29] <icinga-wm>	 RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:07:40] <zabe>	 Could it be that scap only checks the canaries after the change has been fully synced there? In that case no errors are expected since there is no signature drift.
[21:07:48] <tgr>	 cjming: we are still in the deploy window, the errors are not happening anymore, should be fine.
[21:07:51] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1080.eqiad.wmnet with OS bullseye
[21:07:58] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1080.eqiad.wmnet with OS bullseye completed: - elastic1070 (...
[21:08:28] <cjming>	 tgr: thanks
[21:09:03] <tgr>	 Tamzin: NavPopups have always been somewhat unreliable for me. Can you expand on "not loading"? What does mw.inspect() say?
[21:09:25] <Tamzin>	 When I hover over a link, it's the same behavior as if I didn't have NavPopups enabled
[21:09:32] <Tamzin>	 And, 1 sec, lemme see
[21:09:45] <tgr>	 zabe: I did get scap failure in a similar situation in the past.
[21:10:14] <wikibugs>	 (03PS1) 10Clare Ming: Revert "Revert "Enable sticky header edit A/B test for idwiki + viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823229
[21:10:32] <tgr>	 So I'm pretty sure it can detect mid-sync issues (or could in the past, at least).
[21:10:47] <Tamzin>	 Never used mw.inspect before. Which report am I looking at?
[21:11:09] <zabe>	 I see, so it definetly does not feel like it should have went this way
[21:11:21] <tgr>	 Tamzin: type it in the browser console and you should get a list of which JS modules are loaded.
[21:12:40] <wikibugs>	 (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori)
[21:13:13] <Tamzin>	 uh... `3	'ext.gadget.Navigation_popups'	'165.0 KiB'	168985`
[21:13:46] <wikibugs>	 (03PS3) 10Dzahn: site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597)
[21:14:44] <wikibugs>	 (03PS4) 10Dzahn: site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597)
[21:15:29] <wikibugs>	 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10diego)
[21:16:26] <tgr>	 Tamzin: I tried enabling navpopups on enwiki, seems to work
[21:17:20] <Tamzin>	 hmm. maybe something with my machine
[21:20:07] <tgr>	 you could try adding debug=1 to the URL, clearing localStorage, clearing cookies (although that might log you out)
[21:21:57] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Sticky header AB test bucketing for 2 treatment buckets [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/823228 (https://phabricator.wikimedia.org/T312573) (owner: 10Clare Ming)
[21:29:53] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1083.eqiad.wmnet with OS bullseye
[21:30:00] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1083.eqiad.wmnet with OS bullseye
[21:36:57] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:37:10] <wikibugs>	 (03Merged) 10jenkins-bot: Sticky header AB test bucketing for 2 treatment buckets [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/823228 (https://phabricator.wikimedia.org/T312573) (owner: 10Clare Ming)
[21:38:01] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[21:38:23] <AnaisGueyte>	 Hi all! I'm here for the security deployment of T310763, did I miss it or it is still happening? 
[21:38:50] <cjming>	 fyi - i'm still deploying a few changes - should be wrapped up here in a few
[21:39:09] <AnaisGueyte>	 awesome :) thanks
[21:39:09] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Jdlrobson)
[21:39:38] <wikibugs>	 (03CR) 10Gergő Tisza: "Caused T315260." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza)
[21:42:10] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.es6: Backport: [[gerrit:823228|Sticky header AB test bucketing for 2 treatment buckets (T312573)]] (duration: 03m 05s)
[21:42:14] <stashbot>	 T312573: Make sticky header edit button A/B test work for no sticky header control group - https://phabricator.wikimedia.org/T312573
[21:42:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:42:17] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Revert "Revert "Enable sticky header edit A/B test for idwiki + viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823229 (owner: 10Clare Ming)
[21:42:38] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1083.eqiad.wmnet with reason: host reimage
[21:43:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Enable sticky header edit A/B test for idwiki + viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823229 (owner: 10Clare Ming)
[21:43:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:43:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:44:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:45:14] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1083.eqiad.wmnet with reason: host reimage
[21:47:01] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:49:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:50:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:50:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:51:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:54:34] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823229|Revert "Revert "Enable sticky header edit A/B test for idwiki + viwiki""]] (duration: 03m 37s)
[21:54:36] <wikibugs>	 (03PS1) 10Dzahn: phabricator: make it possible to globally enable/disable vcs setup [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597)
[22:00:08] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[22:01:10] <wikibugs>	 (03PS2) 10Dzahn: phabricator: make it possible to globally enable/disable vcs setup [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597)
[22:01:32] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1083.eqiad.wmnet with OS bullseye
[22:01:39] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1083.eqiad.wmnet with OS bullseye completed: - elastic1070 (...
[22:03:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "most of the diff is unrelated: https://puppet-compiler.wmflabs.org/pcc-worker1003/36743/" [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:05:10] <wikibugs>	 (03PS5) 10Dzahn: site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597)
[22:07:01] <wikibugs>	 (03CR) 10Dzahn: "noop confirmed / double checked on phab1001/phab2001 - will make https://gerrit.wikimedia.org/r/c/operations/puppet/+/685136 not as danger" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:07:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop confirmed / double checked on phab1001/phab2001 - will make https://gerrit.wikimedia.org/r/c/operations/puppet/+/685136 not as danger" [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:08:01] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: update production w3creportingapi guard condition [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[22:09:59] <wikibugs>	 (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/823220 this now comes without the VCS setup" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:15:44] <AnaisGueyte>	 I'm not sure if T310763 happened?! 😅
[22:23:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[22:24:24] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[22:28:48] <zabe>	 AnaisGueyte, AFAICS it did not
[22:29:18] <AnaisGueyte>	 Thank you!
[22:29:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36744/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:30:26] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:33:51] <mutante>	 !log rsyncing /srv/repos and /srv/dumps from phab1001 to phab2002 before applying prod puppet role (T313360)
[22:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:55] <stashbot>	 T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360
[22:36:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye
[22:38:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[22:49:35] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Belated +1, looks reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:49:41] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage
[22:52:30] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage
[22:53:37] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[22:59:40] <mutante>	 !log search-loader1001 - killed puppet process that had been running since May 
[22:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:07] <icinga-wm>	 PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:00:29] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[23:03:40] <wikibugs>	 (03PS1) 10Clare Ming: Update sticky header config for idwiki, viwiki A/B experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823268 (https://phabricator.wikimedia.org/T312295)
[23:06:55] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:08:29] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:15:59] <icinga-wm>	 PROBLEM - SSH on phab2002 is CRITICAL: connect to address 10.192.32.54 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:16:13] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[23:16:37] <icinga-wm>	 PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: ssh.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:17:43] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:18:16] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: ssh.service daniel_zahn horrible - still wasnt enough to avoid using the vcs class https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:16] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on phab2002 is CRITICAL: connect to address 10.192.32.54 and port 22: Connection refused daniel_zahn horrible - still wasnt enough to avoid using the vcs class https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:18:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "this is soooo annoying.. even after the previous change and reading all the compiler output this STILL caused the exact issues I was worri" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:20:56] <mutante>	 !log phab2002 - manually removing service IP addresses for git-ssh.codfw.wikimedia.org which were added by puppet even after gerrit:823220 (!) T280597
[23:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:00] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[23:25:38] <wikibugs>	 (03PS5) 10Cwhite: logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090)
[23:26:37] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:29:16] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite)
[23:35:05] <wikibugs>	 (03PS1) 10Dzahn: Revert "site: add phabricator role to phab2002" [puppet] - 10https://gerrit.wikimedia.org/r/823230
[23:35:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "site: add phabricator role to phab2002" [puppet] - 10https://gerrit.wikimedia.org/r/823230 (owner: 10Dzahn)
[23:37:46] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Switch www.mediawiki.org to multi-DC mode [puppet] - 10https://gerrit.wikimedia.org/r/823113 (owner: 10Tim Starling)
[23:39:43] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:40:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "even after this a new host will STILL assign git-ssh IP addresses (or try to and fail) :( reverting" [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:41:03] <icinga-wm>	 RECOVERY - SSH on phab2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:41:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "a setup where applying a puppet role for a service modifies /etc/ssh/sshd_config is just ..argg" [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:45:30] <icinga-wm>	 PROBLEM - PHD should be running on phab2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[23:48:07] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[23:50:07] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:50:47] <mutante>	 ^ phab2002 - removed from icinga
[23:57:20] <wikibugs>	 (03PS2) 10Tim Starling: Discovery: codfw should be pooled for api-ro and appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/818652 (https://phabricator.wikimedia.org/T279664)