[00:12:07] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-07-19 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:15:35] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-07-19 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:18:53] <wikibugs>	 10SRE, 10Phabricator, 10Sustainability (Incident Followup): Unable to view tasks in read-only mode - https://phabricator.wikimedia.org/T313879 (10RLazarus)
[00:22:33] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10RLazarus)
[00:23:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10RLazarus)
[00:24:18] <wikibugs>	 (03PS3) 10BCornwall: geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472)
[00:24:20] <wikibugs>	 (03PS3) 10BCornwall: geodns: Move eqsin, drmrs and esams around in Asia [dns] - 10https://gerrit.wikimedia.org/r/816053 (https://phabricator.wikimedia.org/T311472)
[00:27:31] <wikibugs>	 (03CR) 10BCornwall: geodns: Map out African countries by DC latency (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall)
[00:30:05] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-07-19 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:33:21] <icinga-wm>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-07-19 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:39:15] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:46:06] <wikibugs>	 10SRE, 10SRE-OnFire, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) 05Open→03Resolved a:03RLazarus
[00:57:51] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:01:37] <icinga-wm>	 RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-07-26 00:00:01 (3286 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:04:53] <icinga-wm>	 RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-07-26 00:00:02 (3307 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:13:41] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:24:55] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:57] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:03] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-07-26 00:00:01 (3307 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:22:41] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:57:13] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:00:49] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:12:34] <wikibugs>	 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Metrics, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle) >>! In T240685#8088120, @colewhite wrote: > From @tstarling's comment, I see a few action items: >  # Create a service that can be a drop...
[03:18:09] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:55:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Raymond_Ndibe)
[03:55:59] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-07-26 00:00:02 (3286 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[04:42:27] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Extend /portals max-age from 24h to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881)
[04:51:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul)
[05:10:29] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2086 [puppet] - 10https://gerrit.wikimedia.org/r/817616 (https://phabricator.wikimedia.org/T313482)
[05:10:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2086.codfw.wmnet
[05:15:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[05:19:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[05:19:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2086.codfw.wmnet
[05:19:27] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2086 - https://phabricator.wikimedia.org/T313482 (10Marostegui) a:03Papaul
[05:19:36] <wikibugs>	 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2086 - https://phabricator.wikimedia.org/T313482 (10Marostegui) Papaul all yours
[05:27:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2086 [puppet] - 10https://gerrit.wikimedia.org/r/817616 (https://phabricator.wikimedia.org/T313482) (owner: 10Marostegui)
[05:37:17] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "Looks good to me! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) (owner: 10Filippo Giunchedi)
[05:58:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling)
[06:32:33] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:36:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] conf100[456]: Remove them from client DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/817260 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[06:36:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "This should move mediawiki to using the newer hosts exclusively" [dns] - 10https://gerrit.wikimedia.org/r/817260 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[06:39:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "This will switch over pybal to use the newer hosts. I 'll set a cumin based slow restart of pybal in a screen on cumin1001, with say an in" [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[06:49:43] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:03] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Switch zookeeper clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817265 (https://phabricator.wikimedia.org/T311408)
[06:53:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Switch zookeeper clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817265 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[06:58:05] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: parsoid::testing: install php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/817699 (https://phabricator.wikimedia.org/T312638)
[06:58:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: fallback php version for webrequests [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736)
[06:58:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: parsoid::testing: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/817701 (https://phabricator.wikimedia.org/T312638)
[06:59:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki::webserver: fallback php version for webrequests [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[07:00:04] <wikibugs>	 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10SLyngshede-WMF) A slightly weird way of handling the issue automatically could be using Se...
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:35] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1002 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:01:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36422/console" [puppet] - 10https://gerrit.wikimedia.org/r/817699 (https://phabricator.wikimedia.org/T312638) (owner: 10Giuseppe Lavagetto)
[07:01:15] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2161 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817702 (https://phabricator.wikimedia.org/T313798)
[07:03:17] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 14 hosts with reason: codfw s8 master switch
[07:03:28] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: codfw s8 master switch
[07:05:08] <marostegui>	 !log Restart db2161 to change its binlog format
[07:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:45] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:07:11] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: etcd-backup: Switch shebang to python3 [puppet] - 10https://gerrit.wikimedia.org/r/817704
[07:07:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] parsoid::testing: install php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/817699 (https://phabricator.wikimedia.org/T312638) (owner: 10Giuseppe Lavagetto)
[07:08:47] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:09:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2161 with weight 0 T313798', diff saved to https://phabricator.wikimedia.org/P31954 and previous config saved to /var/cache/conftool/dbconfig/20220727-070901-marostegui.json
[07:09:06] <stashbot>	 T313798: Switchover s8 codfw master - https://phabricator.wikimedia.org/T313798
[07:10:37] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2065 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:10:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] etcd-backup: Switch shebang to python3 [puppet] - 10https://gerrit.wikimedia.org/r/817704 (owner: 10Alexandros Kosiaris)
[07:12:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2161 to s8 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817702 (https://phabricator.wikimedia.org/T313798) (owner: 10Marostegui)
[07:16:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Volans)
[07:18:01] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: update blackbox check alerts runbook link [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947)
[07:18:11] <volans>	 !log restarted ferm on ms-be2065 (had failed for a timed out query)
[07:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:31] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:19:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Volans) FYI this is still randomly alerting on IRC: `  icinga-wm| PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s)             must have write cache policy WriteBac...
[07:21:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:27:13] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1065 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:27:23] <icinga-wm>	 RECOVERY - Check systemd state on conf1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:28:03] <icinga-wm>	 RECOVERY - Check systemd state on conf1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:28:13] <icinga-wm>	 RECOVERY - Check systemd state on conf1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:28:47] <icinga-wm>	 RECOVERY - Check unit status of etcd-backup on conf1008 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:29:03] <icinga-wm>	 RECOVERY - Check unit status of etcd-backup on conf1007 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:30:23] <volans>	 !log restarted ferm on ms-be1065 (had failed for a timed out query)
[07:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2161 to s8 codfw primary T313798', diff saved to https://phabricator.wikimedia.org/P31955 and previous config saved to /var/cache/conftool/dbconfig/20220727-073214-marostegui.json
[07:32:19] <stashbot>	 T313798: Switchover s8 codfw master - https://phabricator.wikimedia.org/T313798
[07:32:59] <icinga-wm>	 RECOVERY - Check unit status of etcd-backup on conf1009 is OK: OK: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:33:29] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[07:34:27] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[07:34:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2079 T313798', diff saved to https://phabricator.wikimedia.org/P31956 and previous config saved to /var/cache/conftool/dbconfig/20220727-073442-marostegui.json
[07:34:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) p:05Triage→03Medium
[07:37:18] <wikibugs>	 (03PS1) 10Marostegui: db2165: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/817706
[07:38:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[07:38:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[07:38:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2165: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/817706 (owner: 10Marostegui)
[07:38:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Volans) p:05Triage→03Medium
[07:40:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2065 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:41:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Volans) @thcipriani: your approval both as group approver and manager is required here ;) I'm preparing the patch in the meanwhile.
[07:41:29] <wikibugs>	 (03PS1) 10Volans: admin: add dduvall to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831)
[07:41:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[07:41:48] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Waiting approval on task." [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) (owner: 10Volans)
[07:41:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[07:42:26] <wikibugs>	 (03PS1) 10Marostegui: db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817708 (https://phabricator.wikimedia.org/T313885)
[07:42:29] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817708 (https://phabricator.wikimedia.org/T313885) (owner: 10Marostegui)
[07:45:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[07:45:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[07:45:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31957 and previous config saved to /var/cache/conftool/dbconfig/20220727-074546-marostegui.json
[07:45:50] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[07:46:34] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817709 (https://phabricator.wikimedia.org/T311493)
[07:48:35] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Productionize db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817709 (https://phabricator.wikimedia.org/T311493)
[07:50:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817709 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[07:50:58] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2022-07-27-070728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817089 (https://phabricator.wikimedia.org/T313300)
[07:51:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:54:13] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36423/console" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall)
[07:55:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31958 and previous config saved to /var/cache/conftool/dbconfig/20220727-075523-marostegui.json
[07:55:28] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[07:55:59] <wikibugs>	 (03PS1) 10Marostegui: db2170: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817710 (https://phabricator.wikimedia.org/T311493)
[07:56:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM, Filippo already gave you a +1 so I'm assuming that the intermediate step of setting ensure to absent isn't required" [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[07:57:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2170: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817710 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[07:59:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans)
[08:00:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2170 (s1, s2) to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P31959 and previous config saved to /var/cache/conftool/dbconfig/20220727-080029-marostegui.json
[08:00:34] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[08:00:35] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:01:51] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) @Raymond_Ndibe the group `cloud-roots` you're requesting access to does not exists. Did you meant `wmcs-roots`? Please check the groups defined in http...
[08:04:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Volans) @ERayfield some clarification is needed:  > Do you currently have shell access (Yes/No)? > Yes  With which user do you have shell access? I was able to find the `erayfiel...
[08:09:15] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:09:51] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:10:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P31960 and previous config saved to /var/cache/conftool/dbconfig/20220727-081029-marostegui.json
[08:10:33] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:11:13] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:15:32] <wikibugs>	 (03PS1) 10Marostegui: db2171: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817711 (https://phabricator.wikimedia.org/T311493)
[08:16:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2171: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817711 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[08:16:37] <wikibugs>	 (03PS1) 10Volans: Add configurationf for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712
[08:18:48] <wikibugs>	 (03PS2) 10Volans: Add configuration for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712
[08:23:23] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:25:02] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2171 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817715 (https://phabricator.wikimedia.org/T311493)
[08:25:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P31961 and previous config saved to /var/cache/conftool/dbconfig/20220727-082535-marostegui.json
[08:26:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2171 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817715 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[08:26:46] <wikibugs>	 (03PS1) 10Volans: Add configuration for the release script [software/spicerack] - 10https://gerrit.wikimedia.org/r/817717
[08:28:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2171 (s5, s6) to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P31962 and previous config saved to /var/cache/conftool/dbconfig/20220727-082817-marostegui.json
[08:28:22] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[08:30:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10mfossati) @Volans : I confirm I can `ssh mwmaint1002.eqiad.wmnet`. @thcipriani : I attended a deployment training session, see {T302204}. I've also scheduled another one: {T313812}...
[08:31:53] <wikibugs>	 (03PS1) 10Volans: Add configuration for the release script [software/homer] - 10https://gerrit.wikimedia.org/r/817719
[08:33:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Aline_Bruenger_WMDE) 05Open→03Resolved a:03Aline_Bruenger_WMDE Thank you very much, @Joe, @Volans and @Dzahn!   Apologies for confusing the LDAP groups and for my late resp...
[08:33:53] <wikibugs>	 (03CR) 10FNegri: wmcs-cinder-backup: fix Retrying() call (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816841 (owner: 10Andrew Bogott)
[08:40:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10Volans) @mfossati great! I think we could close this task then and when the time comes open a separate one for `deployment`. Please mention in the future request to convert `restri...
[08:40:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31964 and previous config saved to /var/cache/conftool/dbconfig/20220727-084042-marostegui.json
[08:40:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[08:40:47] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[08:40:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[08:40:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:41:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:41:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T312990)', diff saved to https://phabricator.wikimedia.org/P31965 and previous config saved to /var/cache/conftool/dbconfig/20220727-084120-marostegui.json
[08:42:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to maintenance servers for mfossati - https://phabricator.wikimedia.org/T313706 (10mfossati) 05Open→03Resolved a:03mfossati That sounds good, closing!
[08:43:15] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup role from db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817721 (https://phabricator.wikimedia.org/T311493)
[08:44:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup role from db2172 [puppet] - 10https://gerrit.wikimedia.org/r/817721 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[08:46:44] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] "lgtm, thanks a lot! I'm going to test and deploy this on all Runners. Hopefully with one last de-registering and re-registering ;)" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall)
[08:47:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312990)', diff saved to https://phabricator.wikimedia.org/P31966 and previous config saved to /var/cache/conftool/dbconfig/20220727-084715-marostegui.json
[08:47:20] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[08:47:59] <Amir1>	 jouncebot: nowandnext
[08:47:59] <jouncebot>	 No deployments scheduled for the next 4 hour(s) and 12 minute(s)
[08:47:59] <jouncebot>	 In 4 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T1300)
[08:48:03] <Amir1>	 awesome
[08:48:48] <wikibugs>	 (03PS1) 10Volans: Add configuration for the release script [software/debmonitor] - 10https://gerrit.wikimedia.org/r/817722
[08:48:55] <wikibugs>	 (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817723
[08:49:57] <elukey>	 akosiaris: o/
[08:50:13] <elukey>	 burrow on kafkamon1002 seems broken after the conf hostname changes in hiera
[08:52:17] <akosiaris>	 elukey: o/
[08:52:20] <akosiaris>	 having a look
[08:53:05] <akosiaris>	 elukey: unsurprising. I restarted the wrong unit. fixed
[08:55:05] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:55:19] <elukey>	 akosiaris: it has been a while since I looked into the hosts, so the "burrow" unit should be masked IIRC.. the other more specific units are still failing for a missing pid file though
[08:56:07] <elukey>	 probably /var/run/burrow is missing, puppet doesn't create it
[08:56:57] <elukey>	 yeah
[08:57:25] <elukey>	 !log manually create /var/run/burrow on kafkamon1002 to allow a clean restart of Burrow daemons (after zookeeper config change)
[08:57:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: set all new swift hosts with 24 HDD [puppet] - 10https://gerrit.wikimedia.org/r/817724 (https://phabricator.wikimedia.org/T294549)
[08:57:52] <elukey>	 !log restart burrow-* on kafkamon1002 to pick up zookeeper changes
[08:57:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817723 (owner: 10Ladsgroup)
[08:58:51] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:27] <wikibugs>	 (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817723 (owner: 10Ladsgroup)
[09:00:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, need to add hosts to hieradata/common/profile/swift.yaml too" [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon)
[09:00:41] <wikibugs>	 (03PS1) 10Volans: Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726
[09:01:27] <elukey>	 !log reboot ml-serve2001 - T313822
[09:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:37] <stashbot>	 T313822: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822
[09:02:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P31967 and previous config saved to /var/cache/conftool/dbconfig/20220727-090221-marostegui.json
[09:02:31] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Thanks for this, looks like what we want :)" [puppet] - 10https://gerrit.wikimedia.org/r/817724 (https://phabricator.wikimedia.org/T294549) (owner: 10Filippo Giunchedi)
[09:02:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet
[09:02:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set all new swift hosts with 24 HDD [puppet] - 10https://gerrit.wikimedia.org/r/817724 (https://phabricator.wikimedia.org/T294549) (owner: 10Filippo Giunchedi)
[09:03:18] <wikibugs>	 (03PS2) 10Volans: Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726
[09:05:43] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: Fixing favicon of wikiquote and wikibooks, take II (duration: 03m 49s)
[09:06:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:08:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update blackbox check alerts runbook link [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) (owner: 10Filippo Giunchedi)
[09:09:08] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized portals: Fixing favicon of wikiquote and wikibooks, take II (duration: 03m 24s)
[09:11:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:11:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:11:23] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet
[09:11:36] <wikibugs>	 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10elukey) @Papaul host rebooted! It is not running any K8s pods at the moment so if any maintenance is needed, feel free to downtime and go ahead :)  For the ML-Team - the node...
[09:11:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:13:59] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2087 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817728 (https://phabricator.wikimedia.org/T313483)
[09:15:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2087 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/817728 (https://phabricator.wikimedia.org/T313483) (owner: 10Marostegui)
[09:17:17] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2173 [puppet] - 10https://gerrit.wikimedia.org/r/817729 (https://phabricator.wikimedia.org/T311493)
[09:18:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2173 [puppet] - 10https://gerrit.wikimedia.org/r/817729 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[09:21:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2087.codfw.wmnet
[09:22:22] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2087 [puppet] - 10https://gerrit.wikimedia.org/r/817731 (https://phabricator.wikimedia.org/T313483)
[09:24:58] <wikibugs>	 (03PS1) 10Elukey: ml-services: move prod docker images to KServe 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982)
[09:25:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2087 [puppet] - 10https://gerrit.wikimedia.org/r/817731 (https://phabricator.wikimedia.org/T313483) (owner: 10Marostegui)
[09:25:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[09:26:54] <wikibugs>	 10SRE, 10Phabricator, 10Sustainability (Incident Followup): Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Aklapper)
[09:28:33] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:28:45] <marostegui>	 ^ fixing
[09:29:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2087 from dbctl T313483', diff saved to https://phabricator.wikimedia.org/P31968 and previous config saved to /var/cache/conftool/dbconfig/20220727-092917-marostegui.json
[09:29:22] <stashbot>	 T313483: decommission db2087 - https://phabricator.wikimedia.org/T313483
[09:29:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P31969 and previous config saved to /var/cache/conftool/dbconfig/20220727-092924-marostegui.json
[09:29:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:29:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2087.codfw.wmnet
[09:30:21] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2087 - https://phabricator.wikimedia.org/T313483 (10Marostegui) a:03Papaul
[09:30:28] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2087 - https://phabricator.wikimedia.org/T313483 (10Marostegui) @Papaul this is ready
[09:31:59] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: Fixing favicon of wikiquote and wikibooks, take III (duration: 03m 19s)
[09:32:19] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ml-serve2001.codfw.wmnet with reason: memtest86+ run
[09:32:33] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ml-serve2001.codfw.wmnet with reason: memtest86+ run
[09:32:39] <wikibugs>	 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b087dff3-f32b-4842-9f10-401f09f59c0c) set by klausman@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their ser...
[09:33:34] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:35:36] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized portals: Fixing favicon of wikiquote and wikibooks, take III (duration: 03m 36s)
[09:36:40] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:37:16] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:39:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:43:06] <elukey>	 klausman: I don't recall if the downtime cookbook also downtimes alerts.wikimedia.org too, from --^ it seems that we may need to add more downtime
[09:43:18] <klausman>	 Will do
[09:44:07] <volans>	 elukey: it does for anything with the instance in alertmanager matching what you downtimed
[09:44:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312990)', diff saved to https://phabricator.wikimedia.org/P31970 and previous config saved to /var/cache/conftool/dbconfig/20220727-094430-marostegui.json
[09:44:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[09:44:36] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[09:44:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[09:44:52] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:44:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T312990)', diff saved to https://phabricator.wikimedia.org/P31971 and previous config saved to /var/cache/conftool/dbconfig/20220727-094452-marostegui.json
[09:44:58] <klausman>	 volans: I added a 7d downtime for ml-serve2001.codfw.wmnet, which apparently didn't cover the above alert
[09:46:05] <volans>	 it matches instance=~"^(ml\-serve2001)(:[0-9]+)?$"
[09:46:08] <volans>	 on alertmanager
[09:46:35] <volans>	 cc godog for more expertise :)
[09:46:48] <klausman>	 That rehex doesn't include the domain
[09:46:56] <volans>	 silence ID is b087dff3-f32b-4842-9f10-401f09f59c0c
[09:47:13] <volans>	 klausman: yes, because the instance should not have it
[09:47:15] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Promote db2144 to x2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811)
[09:47:40] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811) (owner: 10Marostegui)
[09:47:45] <klausman>	 volans: except it does :)
[09:48:11] * volans checking for the related task
[09:48:18] <godog>	 ah yeah, I remember we ran into this issue before
[09:48:40] <godog>	 T304481 cc volans 
[09:48:41] <stashbot>	 T304481: kubernetes / calico alerts have instance with fqdn not hostname - https://phabricator.wikimedia.org/T304481
[09:48:48] <volans>	 yeah just found it
[09:49:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[09:49:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[09:49:39] <klausman>	 Can silences be edited?
[09:50:36] <godog>	 klausman: certainly, you can do it via web from alerts.w.o top right 'bell' icon
[09:50:49] <godog>	 then 'browse', and 'edit' on your silence's entry
[09:51:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312990)', diff saved to https://phabricator.wikimedia.org/P31972 and previous config saved to /var/cache/conftool/dbconfig/20220727-095101-marostegui.json
[09:51:08] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[09:51:53] <godog>	 unfortunately the bandwidth/priority of the task on my end is unchanged at this time :( 
[09:52:35] <godog>	 but yeah re-reading the task I think we should aim at having hostnames and not fqdn, as suggested
[09:52:52] <klausman>	 Ok, edited RE for now
[09:55:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] site.pp: Promote db2144 to x2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811) (owner: 10Marostegui)
[09:55:40] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: fallback php version for webrequests [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736)
[09:55:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: parsoid::testing: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/817701 (https://phabricator.wikimedia.org/T312638)
[09:56:38] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:57:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36425/console" [puppet] - 10https://gerrit.wikimedia.org/r/817701 (https://phabricator.wikimedia.org/T312638) (owner: 10Giuseppe Lavagetto)
[10:01:42] <wikibugs>	 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10klausman) Ok, the machine is booted and sitting in GRUB. @Papaul I can't seem to run memtes86+ via idrac (I just get a black screen). Can you check whether it works with direc...
[10:02:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [WIP] Tune wikidata language selector autocomplete (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[10:03:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: Add eu, hu & hy wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/817269 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira)
[10:04:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266)
[10:05:36] <wikibugs>	 (03CR) 10Jcrespo: "I've realized that prometheus needs access to zarcillo for job generation, I will include it." [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo)
[10:06:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P31974 and previous config saved to /var/cache/conftool/dbconfig/20220727-100607-marostegui.json
[10:07:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The result is sth like the following:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[10:08:50] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10LSobanski)
[10:09:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 15.0 [puppet] - 10https://gerrit.wikimedia.org/r/817211 (https://phabricator.wikimedia.org/T309062) (owner: 10Jelto)
[10:09:21] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Add eu, hu & hy wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/817269 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira)
[10:10:02] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:11:26] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817741 (https://phabricator.wikimedia.org/T311493)
[10:12:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817741 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[10:12:53] <wikibugs>	 (03Abandoned) 10Marostegui: install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817741 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[10:14:00] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817742 (https://phabricator.wikimedia.org/T311493)
[10:15:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db217[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/817742 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[10:15:49] <wikibugs>	 (03CR) 10Jbond: "I also need to run the vtc tests before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/817299 (owner: 10Jbond)
[10:18:24] <wikibugs>	 (03PS7) 10Jcrespo: db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149)
[10:21:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P31976 and previous config saved to /var/cache/conftool/dbconfig/20220727-102113-marostegui.json
[10:21:21] <wikibugs>	 (03PS1) 10Klausman: ml-k8s: add dummy secrects for articleoutlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888)
[10:27:17] <claime>	 /away/go 5
[10:27:29] <claime>	 woops, mb
[10:28:14] <wikibugs>	 (03PS4) 10Jbond: P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825)
[10:30:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) @Raymond_Ndibe I've noticed that you currently have 4 different SSH keys in your Wikitech (LDAP) account, and the comments on the keys have different n...
[10:32:49] <wikibugs>	 (03PS1) 10Marostegui: db2072: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817747 (https://phabricator.wikimedia.org/T311493)
[10:33:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2072: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817747 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[10:34:16] <wikibugs>	 (03PS3) 10MVernon: swift: drain older systems, bring some new ones online [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549)
[10:36:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312990)', diff saved to https://phabricator.wikimedia.org/P31978 and previous config saved to /var/cache/conftool/dbconfig/20220727-103619-marostegui.json
[10:36:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[10:36:25] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[10:36:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[10:36:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T312990)', diff saved to https://phabricator.wikimedia.org/P31979 and previous config saved to /var/cache/conftool/dbconfig/20220727-103640-marostegui.json
[10:37:21] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon)
[10:42:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312990)', diff saved to https://phabricator.wikimedia.org/P31980 and previous config saved to /var/cache/conftool/dbconfig/20220727-104204-marostegui.json
[10:42:11] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[10:42:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon)
[10:46:29] <Emperor>	 !log update cassandradev packages for stretch to 3.11.13 T313742
[10:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:33] <stashbot>	 T313742: Import Cassandra 3.11.13 as 'dev', Stretch - https://phabricator.wikimedia.org/T313742
[10:47:02] <wikibugs>	 (03CR) 10Volans: "Reviewed current implementation, replied to comment with alternative approach." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[10:49:45] <wikibugs>	 (03PS2) 10Klausman: ml-k8s: add dummy secrects for articleoutlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888)
[10:50:43] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: drain older systems, bring some new ones online [puppet] - 10https://gerrit.wikimedia.org/r/817209 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon)
[10:50:50] <wikibugs>	 (03PS3) 10Klausman: ml-k8s: add dummy secrects for article-outlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888)
[10:51:04] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-k8s: add dummy secrects for article-outlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[10:51:09] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-k8s: add dummy secrects for article-outlink [labs/private] - 10https://gerrit.wikimedia.org/r/817744 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[10:57:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P31981 and previous config saved to /var/cache/conftool/dbconfig/20220727-105710-marostegui.json
[10:57:32] <wikibugs>	 (03PS1) 10Klausman: ml k8s: Add article-outlink section to deployment server cfg [puppet] - 10https://gerrit.wikimedia.org/r/817749 (https://phabricator.wikimedia.org/T313888)
[11:04:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712 (owner: 10Volans)
[11:05:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/spicerack] - 10https://gerrit.wikimedia.org/r/817717 (owner: 10Volans)
[11:05:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/homer] - 10https://gerrit.wikimedia.org/r/817719 (owner: 10Volans)
[11:06:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the review!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[11:07:38] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712 (owner: 10Volans)
[11:07:44] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/spicerack] - 10https://gerrit.wikimedia.org/r/817717 (owner: 10Volans)
[11:07:52] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/homer] - 10https://gerrit.wikimedia.org/r/817719 (owner: 10Volans)
[11:08:00] <wikibugs>	 (03PS1) 10Klausman: ML k8s: fix articletopic-outlink names [labs/private] - 10https://gerrit.wikimedia.org/r/817750 (https://phabricator.wikimedia.org/T313888)
[11:09:45] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] ML k8s: fix articletopic-outlink names [labs/private] - 10https://gerrit.wikimedia.org/r/817750 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[11:11:26] <wikibugs>	 (03Merged) 10jenkins-bot: Add configuration for the release script [software/pywmflib] - 10https://gerrit.wikimedia.org/r/817712 (owner: 10Volans)
[11:12:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add configuration for the release script [software/homer] - 10https://gerrit.wikimedia.org/r/817719 (owner: 10Volans)
[11:12:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:12:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P31982 and previous config saved to /var/cache/conftool/dbconfig/20220727-111216-marostegui.json
[11:13:07] <wikibugs>	 (03PS1) 10Klausman: ML k8s: add configuration for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/817751 (https://phabricator.wikimedia.org/T313888)
[11:14:16] <wikibugs>	 (03Merged) 10jenkins-bot: Add configuration for the release script [software/spicerack] - 10https://gerrit.wikimedia.org/r/817717 (owner: 10Volans)
[11:15:02] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:18] <volans>	 Emperor: ^^^
[11:16:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:19:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good," [puppet] - 10https://gerrit.wikimedia.org/r/816206 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori)
[11:27:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312990)', diff saved to https://phabricator.wikimedia.org/P31983 and previous config saved to /var/cache/conftool/dbconfig/20220727-112722-marostegui.json
[11:27:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[11:27:32] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[11:27:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[11:28:28] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] "LGTM. Maybe add a short note to the commit msg on why the predictor stanzas are deleted." [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey)
[11:28:35] <Emperor>	 volans: thanks for the ping; that alert also crops up in #wikimedia-data-persistence 
[11:31:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[11:31:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[11:31:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T312990)', diff saved to https://phabricator.wikimedia.org/P31984 and previous config saved to /var/cache/conftool/dbconfig/20220727-113136-marostegui.json
[11:35:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312990)', diff saved to https://phabricator.wikimedia.org/P31985 and previous config saved to /var/cache/conftool/dbconfig/20220727-113557-marostegui.json
[11:36:03] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[11:37:48] * kart_ updating cxserver..
[11:42:27] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-07-27-070728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817089 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry)
[11:45:03] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2174 [puppet] - 10https://gerrit.wikimedia.org/r/817755 (https://phabricator.wikimedia.org/T311493)
[11:46:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2174 [puppet] - 10https://gerrit.wikimedia.org/r/817755 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[11:46:10] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-07-27-070728-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817089 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry)
[11:46:18] <volans>	 Emperor: ack, sorry then, I was worried could go unnoticed here will all the rest
[11:46:40] <Emperor>	 NP
[11:47:53] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[11:48:26] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[11:50:58] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:51:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P31986 and previous config saved to /var/cache/conftool/dbconfig/20220727-115103-marostegui.json
[11:51:52] <godog>	  /away
[11:51:56] <godog>	 lolz
[11:53:20] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[11:54:06] <wikibugs>	 (03PS1) 10FNegri: Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597)
[11:54:09] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[11:54:47] <kart_>	 Grafana no longer displays restart/deploys? There is a switch, but seems not working.
[11:56:52] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[11:57:41] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[12:00:50] <kart_>	 !log Updated cxserver to 2022-07-27-070728-production (T313300, T309577, T310873, T310880)
[12:00:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:58] <stashbot>	 T309577: Drop support for <figure-inline> and use <span> for inline media - https://phabricator.wikimedia.org/T309577
[12:00:58] <stashbot>	 T310873: Post-creation work for blkwiki - https://phabricator.wikimedia.org/T310873
[12:00:58] <stashbot>	 T313300: Enable Section Translation on 10 more Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T313300
[12:00:58] <stashbot>	 T310880: Post-creation work for pcmwiki - https://phabricator.wikimedia.org/T310880
[12:06:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P31987 and previous config saved to /var/cache/conftool/dbconfig/20220727-120609-marostegui.json
[12:07:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:07:38] <wikibugs>	 (03PS1) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817758 (https://phabricator.wikimedia.org/T313300)
[12:10:37] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:14:43] <wikibugs>	 (03CR) 10DCausse: [WIP] Tune wikidata language selector autocomplete (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[12:15:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:05] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:17:34] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:19:21] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: fix re-registration issues [puppet] - 10https://gerrit.wikimedia.org/r/817759 (https://phabricator.wikimedia.org/T311746)
[12:21:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312990)', diff saved to https://phabricator.wikimedia.org/P31988 and previous config saved to /var/cache/conftool/dbconfig/20220727-122115-marostegui.json
[12:21:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[12:21:20] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[12:21:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[12:21:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31989 and previous config saved to /var/cache/conftool/dbconfig/20220727-122147-marostegui.json
[12:23:54] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36426/console" [puppet] - 10https://gerrit.wikimedia.org/r/817759 (https://phabricator.wikimedia.org/T311746) (owner: 10Jelto)
[12:29:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31990 and previous config saved to /var/cache/conftool/dbconfig/20220727-122920-marostegui.json
[12:29:26] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[12:31:07] <wikibugs>	 (03PS2) 10DCausse: Tune the wikidata "language" profile for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869)
[12:32:13] <wikibugs>	 (03CR) 10DCausse: Tune the wikidata "language" profile for wbsearchentities (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[12:36:53] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: enable target bootstrap in beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/817762 (https://phabricator.wikimedia.org/T303559)
[12:36:57] <wikibugs>	 (03PS5) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221
[12:37:57] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[12:40:26] <wikibugs>	 (03CR) 10Volans: "test nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[12:42:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond)
[12:44:23] <wikibugs>	 (03PS6) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221
[12:44:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P31991 and previous config saved to /var/cache/conftool/dbconfig/20220727-124426-marostegui.json
[12:46:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-zeljkofilipin: +2 for pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) @Volans  I need +2 for mediawiki/*
[12:48:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond)
[12:56:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36429/console" [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond)
[12:57:13] <wikibugs>	 (03PS7) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221
[12:59:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P31992 and previous config saved to /var/cache/conftool/dbconfig/20220727-125933-marostegui.json
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T1300).
[13:00:04] <jouncebot>	 phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:53] <wikibugs>	 (03PS8) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221
[13:01:34] <Lucas_WMDE>	 I can deploy!
[13:04:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond)
[13:05:40] <wikibugs>	 (03PS9) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221
[13:06:54] <Lucas_WMDE>	 phuedx: the diffConfig output looks like the rate is still 0 on some Beta wikis, is that correct?
[13:06:58] <wikibugs>	 (03PS1) 10Clément Goubert: admin: add cgoubert to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817767 (https://phabricator.wikimedia.org/T313902)
[13:07:34] <Lucas_WMDE>	 (my browser isn’t happy with the long output, so I piped https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/12559/timestamps/?time=HH:mm:ss&timeZone=GMT+2&appendLog&locale=en_US into less instead)
[13:07:51] <Lucas_WMDE>	 in production testwiki seems to be the only wiki with a 1 rate, which sounds right
[13:10:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond)
[13:10:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36431/console" [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond)
[13:10:39] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:47] <Lucas_WMDE>	 phuedx: are you there?
[13:10:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml k8s: Add article-outlink section to deployment server cfg [puppet] - 10https://gerrit.wikimedia.org/r/817749 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[13:11:52] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml k8s: Add article-outlink section to deployment server cfg [puppet] - 10https://gerrit.wikimedia.org/r/817749 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[13:12:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ML k8s: add configuration for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/817751 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[13:14:28] <wikibugs>	 (03PS2) 10Elukey: ml-services: move prod docker images to KServe 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982)
[13:14:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312990)', diff saved to https://phabricator.wikimedia.org/P31993 and previous config saved to /var/cache/conftool/dbconfig/20220727-131439-marostegui.json
[13:14:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[13:14:45] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[13:14:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[13:15:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T312990)', diff saved to https://phabricator.wikimedia.org/P31994 and previous config saved to /var/cache/conftool/dbconfig/20220727-131500-marostegui.json
[13:16:58] <wikibugs>	 (03PS3) 10Elukey: ml-services: move prod docker images to KServe 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982)
[13:17:01] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ML k8s: add configuration for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/817751 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[13:18:11] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:39] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:20:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312990)', diff saved to https://phabricator.wikimedia.org/P31995 and previous config saved to /var/cache/conftool/dbconfig/20220727-132005-marostegui.json
[13:20:12] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[13:21:11] <icinga-wm>	 RECOVERY - Host ores1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms
[13:21:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: move prod docker images to KServe 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/817732 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey)
[13:23:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:25:30] <wikibugs>	 (03PS1) 10Xcollazo: airflow - Modify platform_eng instance to do deployment of airflow-dags [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858)
[13:25:37] <Lucas_WMDE>	 I’ll test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/817317 on mwdebug1001 in the meantime
[13:25:47] <wikibugs>	 (03PS1) 10Klausman: deployment-server: fix ML model name for articletopic-outlink [puppet] - 10https://gerrit.wikimedia.org/r/817775 (https://phabricator.wikimedia.org/T313888)
[13:26:07] <wikibugs>	 (03PS1) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910)
[13:26:28] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] deployment-server: fix ML model name for articletopic-outlink [puppet] - 10https://gerrit.wikimedia.org/r/817775 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[13:27:09] <elukey>	 7
[13:27:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[13:28:43] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I was actually going to ask if this change would make the negative boost for disambiguations etc. ineffective if they occurred together wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:30:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:31:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:32:21] <wikibugs>	 (03CR) 10DCausse: Tune the wikidata "language" profile for wbsearchentities (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:32:36] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:32:46] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:32:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:34:04] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:34:14] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:34:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:34:24] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:34:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[13:34:28] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:34:56] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Tested it on mwdebug1001, seems to work great. Swiss German no longer beats standard German, but can still readily be found using its labe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:35:05] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:35:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P31997 and previous config saved to /var/cache/conftool/dbconfig/20220727-133511-marostegui.json
[13:35:15] <Lucas_WMDE>	 dcausse: should I deploy the language tuning right now?
[13:35:23] <Lucas_WMDE>	 since we have a deploy window at the moment :)
[13:36:00] <Lucas_WMDE>	 (in the meantime, I’m done testing on mwdebug1001 and wiped my changes using scap pull)
[13:36:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:38:58] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: puppet avoid use of reduce functions - https://phabricator.wikimedia.org/T313910 (10Peachey88)
[13:39:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:39:36] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Tune the wikidata "language" profile for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:39:39] <Lucas_WMDE>	 let’s go :)
[13:40:40] <wikibugs>	 (03Merged) 10jenkins-bot: Tune the wikidata "language" profile for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817317 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:41:03] <dcausse>	 Lucas_WMDE: sorry missed your ping, please go ahead :)
[13:41:10] <Lucas_WMDE>	 \o/
[13:41:24] <Lucas_WMDE>	 probably sync IS.php first, then SearchSettingsForWikidata.php
[13:41:40] <dcausse>	 yes
[13:41:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:42:04] <Lucas_WMDE>	 the two parts in SearchSettings look like they need each other but thankfully they’re in the same file
[13:42:07] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 15.0 [puppet] - 10https://gerrit.wikimedia.org/r/817211 (https://phabricator.wikimedia.org/T309062) (owner: 10Jelto)
[13:42:15] <Lucas_WMDE>	 and we’re only making the profile available tomorrow anyways
[13:42:47] <Lucas_WMDE>	 tested on mwdebug1001, still seems to work
[13:43:23] <Lucas_WMDE>	 syncing
[13:44:34] <wikibugs>	 (03PS2) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852)
[13:45:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:46:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817317|Tune the wikidata "language" profile for wbsearchentities (T307869)]] (1/2) (duration: 03m 29s)
[13:46:52] <stashbot>	 T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869
[13:46:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:47:10] <wikibugs>	 (03PS2) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910)
[13:49:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:49:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:50:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P31998 and previous config saved to /var/cache/conftool/dbconfig/20220727-135017-marostegui.json
[13:50:21] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikidata.php: Config: [[gerrit:817317|Tune the wikidata "language" profile for wbsearchentities (T307869)]] (2/2) (duration: 03m 21s)
[13:50:44] <wikibugs>	 (03PS1) 10Klausman: ML k8s: move articletopic-outlink files to correct subdir [deployment-charts] - 10https://gerrit.wikimedia.org/r/817781 (https://phabricator.wikimedia.org/T313888)
[13:50:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:51:17] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ML k8s: move articletopic-outlink files to correct subdir [deployment-charts] - 10https://gerrit.wikimedia.org/r/817781 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[13:51:30] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36433/console" [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[13:52:18] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui) Removing the DBA tag as this will only affect A7 and we don't have any DBs there.
[13:52:27] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui)
[13:54:05] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:56:35] <wikibugs>	 (03Merged) 10jenkins-bot: ML k8s: move articletopic-outlink files to correct subdir [deployment-charts] - 10https://gerrit.wikimedia.org/r/817781 (https://phabricator.wikimedia.org/T313888) (owner: 10Klausman)
[13:59:03] <wikibugs>	 (03PS1) 10Jbond: O:prometheus:  use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817783 (https://phabricator.wikimedia.org/T313910)
[14:01:58] <wikibugs>	 (03PS1) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910)
[14:02:14] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: remove redundant metrics [puppet] - 10https://gerrit.wikimedia.org/r/803297 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:02:36] <wikibugs>	 (03PS2) 10MVernon: swift: stop flinging thumbnails at other DC in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102)
[14:03:31] <elukey>	 /7
[14:03:35] <elukey>	 uff sorry :)
[14:04:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::cinder: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815681 (owner: 10Majavah)
[14:05:05] <wikibugs>	 (03CR) 10MVernon: swift: stop flinging thumbnails at other DC in rewrite.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) (owner: 10MVernon)
[14:05:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312990)', diff saved to https://phabricator.wikimedia.org/P31999 and previous config saved to /var/cache/conftool/dbconfig/20220727-140523-marostegui.json
[14:05:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[14:05:29] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[14:05:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[14:05:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36434/console" [puppet] - 10https://gerrit.wikimedia.org/r/817783 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[14:05:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T312990)', diff saved to https://phabricator.wikimedia.org/P32000 and previous config saved to /var/cache/conftool/dbconfig/20220727-140544-marostegui.json
[14:06:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36435/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[14:10:23] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312990)', diff saved to https://phabricator.wikimedia.org/P32001 and previous config saved to /var/cache/conftool/dbconfig/20220727-141108-marostegui.json
[14:11:12] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[14:13:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36437/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[14:15:31] <wikibugs>	 (03PS1) 10Clare Ming: Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296)
[14:16:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: apply
[14:16:55] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply
[14:20:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "I'd like the two pieces of this decoupled: first let's get the new nodes puppetized, running, and clustered before we actually tell openst" [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[14:20:49] <wikibugs>	 (03PS2) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910)
[14:21:29] <wikibugs>	 (03PS4) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651)
[14:22:13] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36439/console" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:22:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add cgoubert to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817767 (https://phabricator.wikimedia.org/T313902) (owner: 10Clément Goubert)
[14:22:50] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[14:22:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "Oh, and actually, once the new nodes are ready to receive traffic we can move the service rather than adding the additional nodes... will " [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[14:23:14] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet
[14:23:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::designate: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815683 (owner: 10Majavah)
[14:26:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32002 and previous config saved to /var/cache/conftool/dbconfig/20220727-142614-marostegui.json
[14:26:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36440/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[14:27:30] <wikibugs>	 (03PS3) 10Nskaggs: Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640)
[14:27:53] <wikibugs>	 (03PS3) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910)
[14:28:13] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[14:28:30] <wikibugs>	 (03PS2) 10Eevans: Do not assign CASSANDRA_LOG_DIR from environment config [puppet] - 10https://gerrit.wikimedia.org/r/816805 (https://phabricator.wikimedia.org/T309896)
[14:29:10] <wikibugs>	 (03PS1) 10Jbond: do not merge! [puppet] - 10https://gerrit.wikimedia.org/r/817788
[14:30:37] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[14:31:26] <wikibugs>	 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10Papaul) The reboot fixed the DIMM error for now:  `   The self-heal operation successfully completed at DIMM DIMM_A2.  Wed 27 Jul 2022 09:06:24  The self-heal operation succes...
[14:32:59] <wikibugs>	 (03CR) 10Andrew Bogott: "This isn't a concern with this patch exactly, but I'm now thinking that existing trove VMs won't know to switch over to the new rabbit ser" [puppet] - 10https://gerrit.wikimedia.org/r/815723 (owner: 10Majavah)
[14:33:07] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:33:11] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:33:39] <wikibugs>	 (03PS4) 10Nskaggs: Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640)
[14:34:21] <wikibugs>	 (03PS10) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221
[14:34:23] <wikibugs>	 (03PS1) 10Jbond: P:sretest: import blackbox to sretest to check if its just genrally slow [puppet] - 10https://gerrit.wikimedia.org/r/817789
[14:34:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Slst2020)
[14:35:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36441/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[14:36:38] <wikibugs>	 10SRE, 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10Papaul) We received the pre-audit yesterday. The list of server we sent matches the pre-audit.
[14:38:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221 (owner: 10Jbond)
[14:41:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32003 and previous config saved to /var/cache/conftool/dbconfig/20220727-144120-marostegui.json
[14:41:38] <wikibugs>	 (03PS5) 10Nskaggs: Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640)
[14:43:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10nskaggs) +1
[14:46:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10nskaggs) +1, I approve.  @Volans The contract is intended to be ongoing and does not have a defined end date (granted however, the current contract is only for...
[14:48:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10nskaggs)
[14:48:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10nskaggs) @Raymond_Ndibe @Volans I edited the task to reflect the correct LDAP groups.
[14:51:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:51:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:53:16] <wikibugs>	 (03CR) 10Nskaggs: Expand retry logic for cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) (owner: 10Nskaggs)
[14:56:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312990)', diff saved to https://phabricator.wikimedia.org/P32005 and previous config saved to /var/cache/conftool/dbconfig/20220727-145626-marostegui.json
[14:56:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[14:56:31] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[14:56:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[14:56:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T312990)', diff saved to https://phabricator.wikimedia.org/P32006 and previous config saved to /var/cache/conftool/dbconfig/20220727-145646-marostegui.json
[15:02:27] <wikibugs>	 (03PS1) 10AikoChou: ml-services: Add outlink-topic-model isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/817793 (https://phabricator.wikimedia.org/T313888)
[15:04:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) >>! In T313876#8108545, @nskaggs wrote: > +1, I approve. >  > @Volans The contract is intended to be ongoing and does not have a defined end date (gran...
[15:06:26] <wikibugs>	 (03PS22) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249)
[15:10:05] <wikibugs>	 (03PS2) 10Jbond: P:sretest: import blackbox to sretest to check if its just genrally slow [puppet] - 10https://gerrit.wikimedia.org/r/817789
[15:20:32] <wikibugs>	 (03PS8) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074)
[15:21:40] <icinga-wm>	 PROBLEM - Zookeeper Server #page on conf1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[15:22:03] <akosiaris>	 this is me ^
[15:22:08] <akosiaris>	 I 'll fix
[15:22:10] <sukhe>	 hello
[15:22:14] <wikibugs>	 (03PS1) 10MVernon: hieradata: move all of sessionstore to 3.11.13 [puppet] - 10https://gerrit.wikimedia.org/r/817798 (https://phabricator.wikimedia.org/T309896)
[15:22:29] <sukhe>	 thanks akosiaris, I have ACKed the page
[15:22:40] <akosiaris>	 it should recover shortly
[15:24:12] <icinga-wm>	 RECOVERY - Zookeeper Server #page on conf1005 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[15:24:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10jbond) @Raymond_Ndibe i noticed the following in the above request  > Preferred shell username: raymond Are yuo able to point me to the  form, template or link...
[15:25:59] <wikibugs>	 (03PS3) 10Jbond: P:sretest: import blackbox to sretest to check if its just genrally slow [puppet] - 10https://gerrit.wikimedia.org/r/817789
[15:27:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: Add outlink-topic-model isvc to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/817793 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou)
[15:28:55] <icinga-wm>	 RECOVERY - Disk space on aqs1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aqs1004&var-datasource=eqiad+prometheus/ops
[15:29:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[15:31:17] <phuedx>	 Lucas_WMDE: My sincere apologies! Something came up and I had to step away from my computer. I should have updated the deployment calendar as I stepped away but forgot. Again, my apologies. I'll reschedule the deployment to reflect reality
[15:31:47] <Lucas_WMDE>	 phuedx: no problem, it didn’t cost me much ;)
[15:32:00] <Lucas_WMDE>	 maybe see you tomorrow, then ^^
[15:32:17] <phuedx>	 It cost me some karma!
[15:32:23] <Lucas_WMDE>	 oh no D:
[15:32:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10thcipriani) >>! In T313831#8107223, @Volans wrote: > @thcipriani: your approval both as group approver and manager is required here ;) > I'm prep...
[15:32:56] <phuedx>	 Sure. I'll schedule it for tomorrow. I'll also look at what's going on with those Beta wikis in the meantime
[15:33:01] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10thcipriani)
[15:33:26] <wikibugs>	 (03CR) 10Volans: "approved on task" [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) (owner: 10Volans)
[15:34:05] <wikibugs>	 (03PS2) 10Volans: admin: add dduvall to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831)
[15:34:56] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[15:35:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) (owner: 10Volans)
[15:38:41] <wikibugs>	 (03PS11) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817221
[15:38:43] <wikibugs>	 (03PS3) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817777 (https://phabricator.wikimedia.org/T313910)
[15:38:45] <wikibugs>	 (03PS2) 10Jbond: O:prometheus:  use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817783 (https://phabricator.wikimedia.org/T313910)
[15:38:47] <wikibugs>	 (03PS4) 10Jbond: O:prometheus: use map instead of reduce [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910)
[15:39:06] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Do not assign CASSANDRA_LOG_DIR from environment config [puppet] - 10https://gerrit.wikimedia.org/r/816805 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[15:39:35] <wikibugs>	 (03CR) 10Volans: [C: 03+2] admin: add dduvall to phabricator-roots [puppet] - 10https://gerrit.wikimedia.org/r/817707 (https://phabricator.wikimedia.org/T313831) (owner: 10Volans)
[15:40:08] <volans>	 Emperor: if you got my change too in your puppet-merge feel free to merge it
[15:40:41] <Emperor>	 volans: no, just mine
[15:40:55] <volans>	 ack, doing mine then
[15:40:55] <volans>	 thx
[15:40:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: conf100[456]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817806 (https://phabricator.wikimedia.org/T311407)
[15:42:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Volans)
[15:43:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10Volans) 05Open→03Resolved a:03Volans @dduvall all set, patch merged in puppet, will be reflected in the fleet within ~30 minutes. I'm resol...
[15:43:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36442/console" [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[15:44:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] conf100[456]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/817806 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris)
[15:45:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to phab1001/phab2001 for Daniel Duvall (dduvall) - https://phabricator.wikimedia.org/T313831 (10dduvall) Thank you very much!
[15:46:08] <urandom>	 !log restarting Cassandra, sessionstore2001, to restore on-disk logging -- T309896
[15:46:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:14] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[15:48:08] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[15:51:38] <urandom>	 !log rolling Cassandra restart, aqs2001-2012, to restore on-disk logging -- T309896
[15:51:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:44] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[15:54:11] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[15:54:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312990)', diff saved to https://phabricator.wikimedia.org/P32007 and previous config saved to /var/cache/conftool/dbconfig/20220727-155417-marostegui.json
[15:54:22] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[15:57:02] <wikibugs>	 (03PS1) 10Dzahn: phabricator: add phabricator-roots on new phabricator hardware [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597)
[15:58:18] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/814810 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond)
[15:58:36] <wikibugs>	 (03CR) 10Dzahn: "well, I will also add the role to the new hosts asap.. just have to double check it doesn't add duplicate IPs or something.. so this is no" [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[16:03:07] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Icinga: Remove traffic alerts [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[16:03:13] <wikibugs>	 (03PS2) 10BCornwall: Icinga: Remove traffic alerts [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723)
[16:05:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri)
[16:05:13] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) 05In progress→03Resolved
[16:07:05] <wikibugs>	 (03PS2) 10FNegri: Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597)
[16:07:20] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[16:07:49] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet
[16:08:02] <wikibugs>	 (03CR) 10FNegri: [V: 03+2] Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri)
[16:08:07] <wikibugs>	 (03CR) 10FNegri: [V: 03+2 C: 03+2] Add fnegri's SSH key to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/817757 (https://phabricator.wikimedia.org/T312597) (owner: 10FNegri)
[16:09:21] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:09:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32008 and previous config saved to /var/cache/conftool/dbconfig/20220727-160923-marostegui.json
[16:10:15] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[16:22:00] <wikibugs>	 (03PS4) 10Majavah: P:openstack::trove: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815723
[16:22:02] <wikibugs>	 (03PS2) 10Majavah: hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818
[16:22:04] <wikibugs>	 (03PS1) 10Majavah: site: install rabbit on cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/817817
[16:22:27] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[16:23:47] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36443/console" [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah)
[16:24:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site: install rabbit on cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah)
[16:24:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32009 and previous config saved to /var/cache/conftool/dbconfig/20220727-162429-marostegui.json
[16:26:48] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah)
[16:31:05] <andrewbogott>	 !log this is a sample log, demonstrating to dhinus
[16:31:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:10] <urandom>	 !log rolling Cassandra restart, aqs1010-1015, to restore on-disk logging -- T309896
[16:31:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:13] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[16:31:32] <wikibugs>	 (03CR) 10Majavah: hieradata: switch traffic to cloudrabbit1001-3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[16:32:07] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[16:32:08] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet
[16:34:24] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[16:34:41] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:39:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312990)', diff saved to https://phabricator.wikimedia.org/P32010 and previous config saved to /var/cache/conftool/dbconfig/20220727-163935-marostegui.json
[16:39:40] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[16:39:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance
[16:39:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance
[16:39:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance
[16:40:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance
[16:40:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) (owner: 10MVernon)
[16:40:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[16:42:24] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1002.eqiad.wmnet
[16:43:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[16:44:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[16:44:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T312990)', diff saved to https://phabricator.wikimedia.org/P32011 and previous config saved to /var/cache/conftool/dbconfig/20220727-164425-marostegui.json
[16:51:31] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296) (owner: 10Clare Ming)
[16:52:39] <icinga-wm>	 PROBLEM - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:52:40] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T313952 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:52:47] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T313952 (10ops-monitoring-bot)
[16:59:33] <icinga-wm>	 PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops
[17:01:01] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:08:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T312990)', diff saved to https://phabricator.wikimedia.org/P32012 and previous config saved to /var/cache/conftool/dbconfig/20220727-170856-marostegui.json
[17:09:02] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[17:10:09] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[17:10:47] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:12:29] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift
[17:15:45] <wikibugs>	 (03CR) 10Dzahn: gitlab: add reserved service IP 208.80.154.8, point to replica-new (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[17:16:36] <wikibugs>	 (03CR) 10Dzahn: gitlab: add reserved service IP 208.80.154.8, point to replica-new (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[17:20:39] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:21:16] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@1a72195]: switch image_suggestions_manual from _delta to _full
[17:21:54] <wikibugs>	 (03PS2) 10Dzahn: gitlab: add reserved service IP 208.80.154.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142)
[17:22:34] <wikibugs>	 (03PS3) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142)
[17:23:17] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@1a72195]: switch image_suggestions_manual from _delta to _full (duration: 02m 01s)
[17:23:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[17:24:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P32013 and previous config saved to /var/cache/conftool/dbconfig/20220727-172402-marostegui.json
[17:24:44] <wikibugs>	 (03PS4) 10Dzahn: gitlab: add reserved service IP 208.80.154.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142)
[17:26:43] <wikibugs>	 (03PS5) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142)
[17:26:47] <wikibugs>	 (03CR) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[17:28:24] <wikibugs>	 (03PS1) 10Dzahn: admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838
[17:30:37] <wikibugs>	 (03CR) 10Dzahn: "I was first just looking at gerrit-deployers group because it came up in the standup today. Then saw how that includes gerrit-roots and fr" [puppet] - 10https://gerrit.wikimedia.org/r/817838 (owner: 10Dzahn)
[17:31:59] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:32:58] <wikibugs>	 (03PS1) 10RobH: updating sku list for new r440 sku [software] - 10https://gerrit.wikimedia.org/r/817839
[17:33:23] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating sku list for new r440 sku [software] - 10https://gerrit.wikimedia.org/r/817839 (owner: 10RobH)
[17:34:04] <wikibugs>	 (03Merged) 10jenkins-bot: updating sku list for new r440 sku [software] - 10https://gerrit.wikimedia.org/r/817839 (owner: 10RobH)
[17:39:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P32014 and previous config saved to /var/cache/conftool/dbconfig/20220727-173908-marostegui.json
[17:40:02] <wikibugs>	 (03PS2) 10Jcrespo: Initial commit [software/pampinus] - 10https://gerrit.wikimedia.org/r/817294 (https://phabricator.wikimedia.org/T283017)
[17:40:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10RobH)
[17:41:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10RobH)
[17:44:06] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add gerrit2002 to list of migration dest hosts [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250)
[17:44:33] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:45:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10RobH)
[17:45:22] <wikibugs>	 (03CR) 10Dzahn: "CCers: this is all stuff we did in the past to make migrations easier, yay:" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[17:45:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10RobH)
[17:46:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10RobH) @herron,  The ordering task lacked racking details, but since we had all the info for the codfw kafka-logging order already, I was able to figure out most of them....
[17:49:33] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:50:49] <mutante>	 Error while evaluating a Resource Statement, Unknown resource type: 'monitoring::alerts::traffic_drop' (file: /etc/puppet/modules/profile/manifests/prometheus/alerts.pp, line: 195, column: 5) on node alert1001.wikimedia.org
[17:54:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T312990)', diff saved to https://phabricator.wikimedia.org/P32015 and previous config saved to /var/cache/conftool/dbconfig/20220727-175414-marostegui.json
[17:54:20] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[17:56:49] <wikibugs>	 (03PS1) 10Volans: admin: add raymond-ndibe user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817843 (https://phabricator.wikimedia.org/T313876)
[17:57:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans)
[17:57:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Volans) I've uploaded the patch, but will wait on @Raymond_Ndibe replies before proceeding.
[17:59:25] <wikibugs>	 (03PS1) 10Zabe: prometheus: remove traffic_drop alerts [puppet] - 10https://gerrit.wikimedia.org/r/817844 (https://phabricator.wikimedia.org/T300723)
[17:59:50] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10wiki_willy) RMA shipped out by Chris on Tuesday, July 26  >>! In T312745#8088364, @Cmjohnson wrote: > Replaced the line card, and placed the old one in the same...
[18:00:04] <jouncebot>	 brennen and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T1800).
[18:00:04] <jouncebot>	 brennen and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T1800).
[18:00:45] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] prometheus: remove traffic_drop alerts [puppet] - 10https://gerrit.wikimedia.org/r/817844 (https://phabricator.wikimedia.org/T300723) (owner: 10Zabe)
[18:00:56] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] prometheus: remove traffic_drop alerts [puppet] - 10https://gerrit.wikimedia.org/r/817844 (https://phabricator.wikimedia.org/T300723) (owner: 10Zabe)
[18:06:25] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:07:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Volans)
[18:07:28] <brennen>	 o/ - currently blocking on T313836
[18:07:29] <stashbot>	 T313836: MediaWiki\Extension\Translate\TtmServer\ServiceCreationFailure: Unknown type for name 'Apertium': cxserver - https://phabricator.wikimedia.org/T313836
[18:08:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Volans) p:05Triage→03Medium
[18:08:59] <wikibugs>	 (03PS1) 10Volans: admin: add sstefanova user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817845 (https://phabricator.wikimedia.org/T313934)
[18:09:14] <wikibugs>	 (03Abandoned) 10Majavah: discovery: switchover doc to doc1002 [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[18:10:36] <wikibugs>	 (03PS1) 10BCornwall: Remove more traffic alerts [puppet] - 10https://gerrit.wikimedia.org/r/817866 (https://phabricator.wikimedia.org/T300723)
[18:15:03] <rzl>	 brennen, jeena: posting a change shortly that'll decom mw2254, currently a scap proxy -- but just for clarity, it won't be going anywhere yet, you have the conch
[18:15:33] <wikibugs>	 (03PS1) 10RLazarus: Decom mw2251-2255,2257,2258 [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730)
[18:16:17] <jeena>	 Thanks rzl
[18:16:55] <wikibugs>	 (03CR) 10RLazarus: "Posting now but will merge later today, after depooling and running decom cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730) (owner: 10RLazarus)
[18:17:25] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Migrate get-raid-status-megacli to Python3 - https://phabricator.wikimedia.org/T313952 (10Volans) p:05Triage→03Medium
[18:22:47] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Increase es dumps retention from 8 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/817871 (https://phabricator.wikimedia.org/T138562)
[18:22:57] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Increase es dumps retention from 8 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/817871 (https://phabricator.wikimedia.org/T138562)
[18:23:21] <wikibugs>	 (03PS2) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250)
[18:24:07] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/817871 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[18:25:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10masz, 10serviceops: Q1:rack/setup/install new memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH)
[18:25:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH)
[18:25:59] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] db_inventory: Cleanup zarcillo database grants [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo)
[18:26:35] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "wrong patch to +2" [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo)
[18:26:47] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/817181 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo)
[18:27:07] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Increase es dumps retention from 8 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/817871 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[18:28:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH) @joe,  Reassigning this to you per our IRC discussion.  Pending needs from you/#serviceops:  * Please populate racking details section with hostname, OS and all other f...
[18:28:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH)
[18:28:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH)
[18:29:25] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:29:56] <wikibugs>	 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313965 (10RobH)
[18:31:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH)
[18:31:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10RobH)
[18:32:10] <wikibugs>	 10SRE, 10serviceops: eqiad (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313965 (10RobH)
[18:33:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10RobH)
[18:33:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10RobH) @joe,  Reassigning this to you per our IRC discussion.  Pending needs from you/#serviceops:  * Please populate racking details section with hostname, OS and all o...
[18:35:07] <wikibugs>	 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10RobH)
[18:35:27] <wikibugs>	 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10RobH)
[18:36:10] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/817798 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon)
[18:36:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10RobH)
[18:36:47] <wikibugs>	 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10RobH)
[18:37:12] <wikibugs>	 (03PS9) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074)
[18:37:44] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36447/gerrit2002.wikimedia.org/change.gerrit2002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[18:47:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[18:48:40] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[18:50:18] <wikibugs>	 (03PS1) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877
[18:50:20] <wikibugs>	 (03PS1) 10Andrew Bogott: wikimediacloud.org: use cloudcontrol1006 as the primary openstack endpoint [dns] - 10https://gerrit.wikimedia.org/r/817878 (https://phabricator.wikimedia.org/T313268)
[18:51:17] <wikibugs>	 (03CR) 10Andrew Bogott: "Assuming we merge https://gerrit.wikimedia.org/r/c/operations/dns/+/817877, this will need updating to use the service addresses." [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[18:51:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877 (owner: 10Andrew Bogott)
[18:51:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: use cloudcontrol1006 as the primary openstack endpoint [dns] - 10https://gerrit.wikimedia.org/r/817878 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott)
[18:53:43] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul)
[18:54:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::trove: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815723 (owner: 10Majavah)
[18:56:44] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi I moved asw2-c and asw2-d uplink from 0/0 and 0/1 to 1/0 and 1/3 on both router to match codfw. In the future if we have to change row A and ro...
[18:58:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[19:00:59] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add to Acme-chief's Hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817883
[19:01:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] site: install rabbit on cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah)
[19:01:22] <wikibugs>	 (03PS2) 10Andrew Bogott: site: install rabbit on cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/817817 (owner: 10Majavah)
[19:02:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Add to Acme-chief's Hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817883 (owner: 10Andrea Denisse)
[19:02:45] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add to Acme-chief's Hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817883 (owner: 10Andrea Denisse)
[19:09:13] <wikibugs>	 (03PS2) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877
[19:10:29] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T313607 (10Jgreen) >>! In T313607#8103921, @Volans wrote: > @Jgreen All this for us is managed by the `sre.hosts.decommission` cookbook, that we can't run for your hosts. > I think you sho...
[19:11:07] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/817887 (https://phabricator.wikimedia.org/T309074)
[19:11:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/817887 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:11:41] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/817887 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:11:48] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/817887 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:13:50] <wikibugs>	 (03CR) 10Majavah: "Thoughts about not using numerical identifiers and going with, say, rabbitmq-a/b/c instead to reduce any possible confusion about the numb" [dns] - 10https://gerrit.wikimedia.org/r/817877 (owner: 10Andrew Bogott)
[19:14:24] <wikibugs>	 (03PS2) 10Andrew Bogott: wikimediacloud.org: use cloudcontrol1006 as the primary openstack endpoint [dns] - 10https://gerrit.wikimedia.org/r/817878 (https://phabricator.wikimedia.org/T313268)
[19:15:27] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:17:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, the compiler output is a bit special but that is because currently the puppet run on icinga is broken and this change fixes it again" [puppet] - 10https://gerrit.wikimedia.org/r/817866 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[19:19:07] <wikibugs>	 (03PS4) 10BCornwall: Remove kafka alerting class [puppet] - 10https://gerrit.wikimedia.org/r/817866 (https://phabricator.wikimedia.org/T300723)
[19:20:57] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:22:17] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Remove kafka alerting class [puppet] - 10https://gerrit.wikimedia.org/r/817866 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[19:26:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-zeljkofilipin: +2 for pwangai - https://phabricator.wikimedia.org/T313794 (10pwangai) 05Open→03Resolved a:03pwangai I am closing this request to follow instructions stipulated at https://www.mediawiki.org/wiki/Gerrit/Privilege_policy/en#Reques...
[19:26:57] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add netmon1003 to scap's hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817890 (https://phabricator.wikimedia.org/T309074)
[19:26:59] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:29:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10RobH)
[19:29:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10RobH)
[19:29:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Add netmon1003 to scap's hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817890 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:29:57] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add netmon1003 to scap's hieradata. [puppet] - 10https://gerrit.wikimedia.org/r/817890 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:32:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10RobH)
[19:33:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: use cloudcontrol1006 as the primary openstack endpoint [dns] - 10https://gerrit.wikimedia.org/r/817878 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott)
[19:33:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10RobH)
[19:34:09] <logmsgbot>	 !log denisse@deploy1002 Started deploy [librenms/librenms@f049593]: Provision LibreNMS on netmon1003
[19:34:14] <logmsgbot>	 !log denisse@deploy1002 Finished deploy [librenms/librenms@f049593]: Provision LibreNMS on netmon1003 (duration: 00m 05s)
[19:44:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @Jclark-ctr when you have time, can you plug on :  ### CR1-eqiad to front of the patch panel port 0/1: 1 breakout cable and the first break out cable go...
[19:45:03] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:47:29] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: VisualEditor: Allow external link paste on mediawikiwiki, metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817893 (https://phabricator.wikimedia.org/T129546)
[19:48:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I can confirm these are the servers from the decom ticket, it matches netbox, this change looks correct. You will have to ignore the warni" [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730) (owner: 10RLazarus)
[19:49:05] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:51:51] <icinga-wm>	 ACKNOWLEDGEMENT - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating Andrew Bogott This may be due to my moving the openstack.eqiad1.wikimediacloud.org endpoint, investigating. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:53:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10RobH)
[19:53:56] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817847 (https://phabricator.wikimedia.org/T33780)
[19:54:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10RobH)
[19:54:37] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817848 (https://phabricator.wikimedia.org/T33780)
[19:55:08] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817849 (https://phabricator.wikimedia.org/T33780)
[19:56:52] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817850 (https://phabricator.wikimedia.org/T33780)
[19:57:01] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:57:02] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817851 (https://phabricator.wikimedia.org/T33780)
[19:58:31] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T2000).
[20:00:05] <jouncebot>	 koi and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <koi>	 hi
[20:00:47] <wikibugs>	 (03PS14) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389)
[20:01:11] <MatmaRex>	 hello
[20:01:19] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:04:25] <mutante>	 jouncebot: now
[20:04:26] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220727T2000)
[20:05:42] <cjming>	 hi - i can deploy
[20:05:49] <cjming>	 sorry for being late
[20:05:59] <wikibugs>	 (03PS2) 10Clare Ming: ptwiki: Restrict "move" permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817373 (https://phabricator.wikimedia.org/T313802) (owner: 10Stang)
[20:07:03] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] ptwiki: Restrict "move" permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817373 (https://phabricator.wikimedia.org/T313802) (owner: 10Stang)
[20:07:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[20:08:39] <wikibugs>	 (03Merged) 10jenkins-bot: ptwiki: Restrict "move" permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817373 (https://phabricator.wikimedia.org/T313802) (owner: 10Stang)
[20:09:21] <cjming>	 koi: ur patch is up on mwdebug1002 - can you check?
[20:10:04] <wikibugs>	 (03PS2) 10Clare Ming: VisualEditor: Allow external link paste on mediawikiwiki, metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817893 (https://phabricator.wikimedia.org/T129546) (owner: 10Bartosz Dziewoński)
[20:10:12] <koi>	 cjming: tested and LGTM
[20:10:19] <cjming>	 syncing!
[20:11:52] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] VisualEditor: Allow external link paste on mediawikiwiki, metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817893 (https://phabricator.wikimedia.org/T129546) (owner: 10Bartosz Dziewoński)
[20:12:43] <MatmaRex>	 cjming: thanks for deploying. my backports can be synced in any order, or all at once. feel free to +2 them ahead of time because i have a lot of them :(
[20:13:04] <cjming>	 MatmaRex: sounds good
[20:13:50] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817373|ptwiki: Restrict "move" permission (T313802)]] (duration: 03m 19s)
[20:13:54] <stashbot>	 T313802: Modify 'move' permissions on ptwiki - https://phabricator.wikimedia.org/T313802
[20:14:05] <wikibugs>	 (03Merged) 10jenkins-bot: VisualEditor: Allow external link paste on mediawikiwiki, metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817893 (https://phabricator.wikimedia.org/T129546) (owner: 10Bartosz Dziewoński)
[20:14:37] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:15:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:15:15] <cjming>	 MatmaRex: 1st config patch on mwdebug1002 if you can test
[20:15:33] <MatmaRex>	 yeah. looking
[20:16:23] <wikibugs>	 (03PS15) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389)
[20:16:25] <wikibugs>	 (03PS1) 10Brennen Bearnes: SearchTranslationsApi: Change the way we fetch TTM services [extensions/Translate] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817855 (https://phabricator.wikimedia.org/T313836)
[20:16:36] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817847 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:16:46] <MatmaRex>	 cjming: looks good
[20:16:54] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817848 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:17:05] <cjming>	 great - syncing config patch
[20:17:26] <brennen>	 cjming: mind pinging me when you're wrapping up?  i'll probably have a backport to go out then and roll the train forward to group1.
[20:17:38] <cjming>	 brennen: sure thing
[20:17:43] <brennen>	 thx!
[20:17:44] <wikibugs>	 (03PS3) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250)
[20:19:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:19:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:20:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:20:35] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack nova: pregenerate db grants for cell 0 [puppet] - 10https://gerrit.wikimedia.org/r/817896
[20:20:43] <MatmaRex>	 cjming: actually, if possible, could we do all of the backports at once? they all affect the same feature so it doesn't make much sense to test each individually
[20:20:55] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817893|VisualEditor: Allow external link paste on mediawikiwiki, metawiki (T129546)]] (duration: 03m 37s)
[20:21:00] <stashbot>	 T129546: Support preserving external links in pasted HTML content - https://phabricator.wikimedia.org/T129546
[20:21:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] OpenStack nova: pregenerate db grants for cell 0 [puppet] - 10https://gerrit.wikimedia.org/r/817896 (owner: 10Andrew Bogott)
[20:22:17] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:22:18] <cjming>	 MatmaRex: i think so - i'm unclear if i have to wait for things to merge on the release branches before rebasing
[20:23:12] <MatmaRex>	 cjming: i just created them, so they should merge cleanly without the need for rebasing
[20:23:47] <cjming>	 MatmaRex: so you're saying just merge them all at once
[20:23:59] <MatmaRex>	 yeah
[20:24:07] <cjming>	 alrighty
[20:24:19] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817849 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:24:24] <MatmaRex>	 if you +2 them all, they all should start the tests now, and merge whenever that finishes
[20:24:28] <wikibugs>	 (03PS2) 10Andrew Bogott: OpenStack nova: pregenerate db grants for cell 0 [puppet] - 10https://gerrit.wikimedia.org/r/817896
[20:24:48] <cjming>	 yup
[20:24:53] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817850 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:25:02] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817851 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:25:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:26:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:26:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:27:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:27:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[20:28:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: pregenerate db grants for cell 0 [puppet] - 10https://gerrit.wikimedia.org/r/817896 (owner: 10Andrew Bogott)
[20:32:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:32:17] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Nova: typo fix for cell db name [puppet] - 10https://gerrit.wikimedia.org/r/817899
[20:34:59] <wikibugs>	 (03Merged) 10jenkins-bot: jquery.textSelection: Support more edge cases of document.execCommand [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817847 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:35:39] <wikibugs>	 (03Merged) 10jenkins-bot: jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817848 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:35:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: typo fix for cell db name [puppet] - 10https://gerrit.wikimedia.org/r/817899 (owner: 10Andrew Bogott)
[20:37:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:37:08] <wikibugs>	 (03PS2) 10Zabe: wikimedia.org: Add developers.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597)
[20:38:07] <wikibugs>	 (03PS16) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389)
[20:39:32] <cjming>	 MatmaRex: 817847 + 817848 are up on mwdebug1002 if you want to check since we're still waiting for your other stuff to merge
[20:40:08] <wikibugs>	 (03PS17) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389)
[20:40:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:40:40] <MatmaRex>	 thanks, i'd have to recheck with the other patches anyway, so i'll just wait if that's okay?
[20:40:47] <cjming>	 np
[20:41:17] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:41:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1421 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1447 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:34] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:41:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:41:35] <TheresNoTime>	 mhm
[20:41:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:39] <sukhe>	 acked
[20:41:41] <sukhe>	 here
[20:41:43] <icinga-wm>	 PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:57] <mutante>	 here, trying to ACK alerts but it's many
[20:41:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1425 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:58] <icinga-wm>	 PROBLEM - Apache HTTP on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:41:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw1427 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:01] <icinga-wm>	 PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:04] <mutante>	 this is the middle of deployment?
[20:42:07] <TheresNoTime>	 cjming: may want to hold your deploy?
[20:42:13] <cjming>	 gah
[20:42:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:42:16] <cjming>	 ok
[20:42:23] <RhinosF1>	 mutante: yes
[20:42:25] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POS
[20:42:30] <sukhe>	 ah hm
[20:42:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:41] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:42:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 1189 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:42:43] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:42:43] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou
[20:42:43] <icinga-wm>	  nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[20:42:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:47] <RhinosF1>	 sukhe, mutante: maybe move to -sre, less noise
[20:42:49] <icinga-wm>	 PROBLEM - Apache HTTP on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:51] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1941 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:42:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1449 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:42:53] <sukhe>	 right thanks
[20:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:43:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:03] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adap
[20:43:03] <icinga-wm>	 nks to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[20:43:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:43:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:43:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:17] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (
[20:43:17] <icinga-wm>	 a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[20:43:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1450 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:18] <cjming>	 can someone lmk if/when it's ok to continue?
[20:43:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:23] <icinga-wm>	 PROBLEM - Apache HTTP on mw1448 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:43:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:33] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (exp
[20:43:33] <icinga-wm>	 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpe
[20:43:33] <icinga-wm>	 tus 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:43:36] <RhinosF1>	 cjming: join #wikimedia-sre
[20:43:37] <icinga-wm>	 PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[20:43:39] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[20:43:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:43] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for Apr
[20:43:43] <icinga-wm>	 016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{dom
[20:43:43] <icinga-wm>	 page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:43:45] <wikibugs>	 (03Merged) 10jenkins-bot: jquery.textSelection: Use non-execCommand when we can't focus the field [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817849 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:43:47] <wikibugs>	 (03Merged) 10jenkins-bot: Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817850 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:43:49] <wikibugs>	 (03Merged) 10jenkins-bot: Delay template insertion until after closing the dialog [extensions/TemplateWizard] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817851 (https://phabricator.wikimedia.org/T33780) (owner: 10Bartosz Dziewoński)
[20:43:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1447.eqiad.wmnet, mw1361.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1348.eqiad.wmnet, mw1314.eqiad.wmnet, mw1412.eqiad.wmnet, mw1378.eqiad.wmnet, mw1404.eqiad.wmnet, mw1362.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1340.eq
[20:44:03] <icinga-wm>	 t, mw1449.eqiad.wmnet, mw1443.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1396.eqiad.wmnet, mw1358.eqiad.wmnet, mw1424.eqiad.wmnet, mw1398.eqiad.wmnet, mw1444.eqiad.wmnet, mw1376.eqiad.wmnet, mw1359.eqiad.wmnet, mw1423.eqiad.wmnet, mw1317.eqiad.wmnet, mw1450.eqiad.wmnet, mw1425.eqiad.wmnet, mw1408.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1312.eqiad.wmnet,
[20:44:03] <icinga-wm>	 eqiad.wmnet, mw1383.eqiad.wmnet, mw1427.eqiad.wmnet, mw1406.eqiad.wmnet, mw1375.eqiad.wmnet, mw1342.eqiad.wmnet, mw1382.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1402.eqiad https://wikitech.wikimedia.org/wiki/PyBal
[20:44:05] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:44:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:07] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): 
[20:44:07] <icinga-wm>	 }/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpecte
[20:44:07] <icinga-wm>	  503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:44:07] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou
[20:44:07] <icinga-wm>	  nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Proton
[20:44:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:44:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:44:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1448.eqiad.wmnet, mw1447.eqiad.wmnet, mw1394.eqiad.wmnet, mw1361.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1388.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1396.eq
[20:44:35] <icinga-wm>	 t, mw1390.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1449.eqiad.wmnet, mw1443.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1358.eqiad.wmnet, mw1359.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1398.eqiad.wmnet, mw1444.eqiad.wmnet, mw1404.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet,
[20:44:35] <icinga-wm>	 eqiad.wmnet, mw1450.eqiad.wmnet, mw1425.eqiad.wmnet, mw1408.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1312.eqiad.wmnet, mw1400.eqiad.wmnet, mw1402.eqiad.wmnet, mw1383.eqiad https://wikitech.wikimedia.org/wiki/PyBal
[20:44:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:47] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:44:47] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response 
[20:44:47] <icinga-wm>	 ived: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page
[20:44:47] <icinga-wm>	 TICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:44:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:53] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:44:53] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:44:57] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:44:57] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:44:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:59] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:03] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[20:45:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:45:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:09] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1447 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:09] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[20:45:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:15] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:15] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:17] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[20:45:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:45:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:45:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:45:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:45:18] <icinga-wm>	 PROBLEM - Apache HTTP on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:45:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:19] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:19] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:20] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:25] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:25] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:25] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:25] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:25] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1449 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:31] <wikibugs>	 (03CR) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/817877 (owner: 10Andrew Bogott)
[20:45:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:33] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:33] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:41] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:45:45] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1427 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:03] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:09] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1448 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:14] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:46:17] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:46:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:46:21] <Tamzin>	 I got 503 problems
[20:46:25] <Tamzin>	 And a 503 is one
[20:46:27] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:27] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:35] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[20:46:37] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 03+1] Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[20:46:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:45] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.276 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:46:49] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:49] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:49] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:49] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:50] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:50] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:46:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:47:59] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] Set CORS headers appropriate to WCQS [puppet] - 10https://gerrit.wikimedia.org/r/817387 (https://phabricator.wikimedia.org/T307391) (owner: 10Ebernhardson)
[20:48:06] <logmsgbot>	 !log sukhe@cumin1001 dbctl commit (dc=all): 'depool db1132', diff saved to https://phabricator.wikimedia.org/P32017 and previous config saved to /var/cache/conftool/dbconfig/20220727-204806-sukhe.json
[20:48:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1448 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.438 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:48:27] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 8.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:29] <icinga-wm>	 PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[20:48:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:31] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1448 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 5.370 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:43] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[20:48:43] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 6.408 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:45] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1447 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.678 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:48:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1421 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.963 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:48:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1422 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.686 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:48:49] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 5.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:48:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:48:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:48:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.204 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:48:53] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 7.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:53] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 7.888 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:53] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 5.529 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:55] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 8.620 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:48:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:48:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:48:57] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:48:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.958 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:48:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:01] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[20:49:01] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1423 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.948 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:03] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:49:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1402 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1422 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.360 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.842 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.971 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.672 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:05] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 3.780 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.469 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:05] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.160 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:07] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:07] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:07] <icinga-wm>	 RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 2.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:07] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:07] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 2.577 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.785 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.711 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.836 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:11] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 3.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.694 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw1425 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.349 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:49:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1402 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1423 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1404 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.505 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.835 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:18] <icinga-wm>	 RECOVERY - Apache HTTP on mw1427 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.852 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:19] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "Insofar as the impact this might have on storage, +1 from me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/816059 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling)
[20:49:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:49:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:39] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:49:39] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:49:41] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:41] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:47] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.566 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:49] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:49:49] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:49:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw1426 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:49:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1447 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:49:57] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[20:50:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:50:03] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:50:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:50:03] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:50:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:06] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:50:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:07] <icinga-wm>	 RECOVERY - Apache HTTP on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:07] <icinga-wm>	 RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.017 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:12] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:13] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:14] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[20:50:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:15] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1343 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:16] <icinga-wm>	 RECOVERY - Apache HTTP on mw1449 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:16] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.374 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:17] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) We just had another issue and db1132 (10.6) was the only one affected again. I will scan thru slow queries tomorrow EU...
[20:50:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:50:19] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1449 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:19] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:20] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:20] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:50:21] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 1.427 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:21] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1426 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:23] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:23] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.464 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:31] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 1.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:37] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1427 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.147 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:37] <icinga-wm>	 RECOVERY - Apache HTTP on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1450 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:43] <icinga-wm>	 RECOVERY - Apache HTTP on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:50:43] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:50:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:50:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:53] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:53] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:50:55] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:55] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:55] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:50:57] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:51:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:51:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:51:01] <icinga-wm>	 RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:51:07] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:51:07] <icinga-wm>	 RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[20:51:07] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:51:09] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[20:51:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:51:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:51:12] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[20:51:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:51:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:51:23] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:51:24] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-Action-API, 10Traffic, and 2 others: API not responding (overflow) - https://phabricator.wikimedia.org/T313986 (10RhinosF1)
[20:51:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:51:31] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:51:34] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:51:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:51:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:51:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[20:52:19] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-Action-API, 10Traffic, and 2 others: API not responding (overflow) - https://phabricator.wikimedia.org/T313986 (10RhinosF1) {T311106}
[20:52:23] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[20:52:39] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:52:39] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:52:49] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:55:25] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[20:55:37] <logmsgbot>	 !log sukhe@cumin1001 dbctl commit (dc=all): 'depool db1111', diff saved to https://phabricator.wikimedia.org/P32018 and previous config saved to /var/cache/conftool/dbconfig/20220727-205536-sukhe.json
[20:55:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:55:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[20:56:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:56:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:56:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[20:56:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:58:06] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1132 and db1111 depooled
[20:58:54] <wikibugs>	 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-Action-API, 10Traffic, and 2 others: API not responding (overflow) - https://phabricator.wikimedia.org/T313986 (10Marostegui) Not sure what's expected from the DBAs here. There's a chain on things that got db1132 overloaded.
[20:59:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[21:00:21] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[21:00:37] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[21:02:35] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[21:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:03:25] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[21:04:12] <wikibugs>	 10SRE: Upload shiny-server .deb to our Buster apt repository - https://phabricator.wikimedia.org/T313989 (10mpopov)
[21:05:36] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Migrate get-raid-status-megacli to Python3 - https://phabricator.wikimedia.org/T313952 (10Peachey88)
[21:05:51] <cjming>	 MatmaRex: all your patches should be up on mwdebug1002 -- can you verify?
[21:06:20] <MatmaRex_>	 sorry, my computer crashed, hope i didn't miss anything cjming
[21:07:00] <cjming>	 MatmaRex: I was just saying all your patches are on mwdebug1002 - can you check?
[21:07:17] <MatmaRex_>	 looking. thanks
[21:09:21] <MatmaRex_>	 cjming: looks good on wmf.21
[21:10:08] <MatmaRex_>	 amf on wmf.22 too
[21:10:12] <MatmaRex_>	 everything looks fine
[21:10:16] <MatmaRex_>	 and*
[21:12:40] <cjming>	 cool - syncing them all
[21:14:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10Andrew) 185.15.57.12/30 should be enough, so let's start with that.  With luck that'll be all we need, and we can leave it as a permanent change.  In t...
[21:17:41] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.22/resources/src/jquery/jquery.textSelection.js: Backport: [[gerrit:817847|jquery.textSelection: Support more edge cases of document.execCommand (T33780)]] (duration: 03m 10s)
[21:18:18] <stashbot>	 T33780: WikiEditor dialogs kill the undo buffer - https://phabricator.wikimedia.org/T33780
[21:21:12] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.21/resources/src/jquery/jquery.textSelection.js: Backport: [[gerrit:817848|jquery.textSelection: Use non-execCommand when we can't focus the field (T33780)]] (duration: 03m 09s)
[21:25:02] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.22/resources/src/jquery/jquery.textSelection.js: Backport: [[gerrit:817849|jquery.textSelection: Use non-execCommand when we can't focus the field (T33780)]] (duration: 03m 22s)
[21:25:07] <stashbot>	 T33780: WikiEditor dialogs kill the undo buffer - https://phabricator.wikimedia.org/T33780
[21:27:19] <urandom>	 !log Removing reserved space on sessionstore storage volumes -- T313991
[21:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:23] <stashbot>	 T313991: Investigate sessionstore Cassandra utilization improvements - https://phabricator.wikimedia.org/T313991
[21:28:56] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/TemplateWizard/resources/ext.TemplateWizard.Dialog.js: Backport: [[gerrit:817850|Delay template insertion until after closing the dialog (T33780)]] (duration: 03m 27s)
[21:29:29] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[21:29:32] <wikibugs>	 (03PS1) 10Bearloga: shiny_server: Minimal dependencies [puppet] - 10https://gerrit.wikimedia.org/r/817903
[21:31:55] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[21:32:47] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/TemplateWizard/resources/ext.TemplateWizard.Dialog.js: Backport: [[gerrit:817851|Delay template insertion until after closing the dialog (T33780)]] (duration: 03m 36s)
[21:32:53] <stashbot>	 T33780: WikiEditor dialogs kill the undo buffer - https://phabricator.wikimedia.org/T33780
[21:33:08] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] SearchTranslationsApi: Change the way we fetch TTM services [extensions/Translate] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817855 (https://phabricator.wikimedia.org/T313836) (owner: 10Brennen Bearnes)
[21:33:10] <cjming>	 MatmaRex: all your changes should be live!
[21:33:26] <MatmaRex_>	 thanks!
[21:33:26] <cjming>	 brennan: all yours
[21:33:43] <cjming>	 !log end of UTC late backport window
[21:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:53] <brennen>	 cjming: thanks - going to sync the above Translate patch and train -> group1.
[21:54:04] <wikibugs>	 (03Merged) 10jenkins-bot: SearchTranslationsApi: Change the way we fetch TTM services [extensions/Translate] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817855 (https://phabricator.wikimedia.org/T313836) (owner: 10Brennen Bearnes)
[21:58:29] <icinga-wm>	 PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:59:03] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/Translate/src/TtmServer: Backport: [[gerrit:817855|SearchTranslationsApi: Change the way we fetch TTM services (T313836)]] (duration: 03m 19s)
[21:59:09] <stashbot>	 T313836: MediaWiki\Extension\Translate\TtmServer\ServiceCreationFailure: Unknown type for name 'Apertium': cxserver - https://phabricator.wikimedia.org/T313836
[21:59:13] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[22:00:40] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817905 (https://phabricator.wikimedia.org/T308075)
[22:00:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817905 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot)
[22:00:57] <icinga-wm>	 RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:01:43] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817905 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot)
[22:02:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:03:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:03:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:04:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:04:53] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[22:05:43] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.22  refs T308075
[22:05:47] <stashbot>	 T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075
[22:08:47] <wikibugs>	 (03CR) 10Neil P. Quinn-WMF: "Our new entries look right! I just commented about a better place to put them in the existing organization scheme." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) (owner: 10Sbisson)
[22:08:52] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.22  refs T308075 (duration: 03m 08s)
[22:09:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:10:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:10:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:11:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:15:17] <icinga-wm>	 PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:15:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:16:09] <wikibugs>	 (03PS1) 10Bearloga: r_lang: Switch from devtools to remotes [puppet] - 10https://gerrit.wikimedia.org/r/817907
[22:26:49] <icinga-wm>	 ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project Andrew Bogott investigating! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[22:26:49] <icinga-wm>	 ACKNOWLEDGEMENT - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack Andrew Bogott investigating! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:32:15] <icinga-wm>	 RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:35:19] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[22:36:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Jclark-ctr) @cmooney  curious if this is a switch config that needs to change  these are racked in 10g racks  but only use 1g ports  db1190 E1 35...
[22:37:21] <wikibugs>	 (03PS1) 10RLazarus: requestctl: Add a missing f on an f-string [software/conftool] - 10https://gerrit.wikimedia.org/r/817910
[22:37:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr)  moss-be1001 and moss-be1002 have been installed  @LSobanski  do you have a thanos-be100? host and can we schedule installation
[22:49:55] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:54:01] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "did not find a value for the name 'profile::gerrit::migration::src_host" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[22:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[22:59:36] <mutante>	 denisse|m: ^ that'll need a "sudo keyholder arm" on the new host
[23:00:11] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Increase core session expiry to 86400 to match CentralAuth [deployment-charts] - 10https://gerrit.wikimedia.org/r/816059 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling)
[23:00:15] <mutante>	 and then it will ask for a passphrase to load the key (https://wikitech.wikimedia.org/wiki/Keyholder#Production_passphrases)
[23:02:30] <denisse|m>	 mutante: Thanks for the heads-up!! I was about to go to have lunch. Do you think this change could wait for about an hour?? If not, I could do it right now. 🙈
[23:03:02] <mutante>	 denisse|m: it can definitely wait an hour
[23:03:18] <mutante>	 go for lunch, it's late :)
[23:03:42] <denisse|m>	 Okay, cool. I'll go for lunch and do that once I'm back. Thanks. :)
[23:04:36] <wikibugs>	 (03Merged) 10jenkins-bot: Increase core session expiry to 86400 to match CentralAuth [deployment-charts] - 10https://gerrit.wikimedia.org/r/816059 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling)
[23:05:55] <wikibugs>	 (03PS4) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250)
[23:08:03] <logmsgbot>	 !log tstarling@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[23:08:13] <logmsgbot>	 !log tstarling@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[23:11:52] <rzl>	 jouncebot: nowandnext
[23:11:52] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 48 minute(s)
[23:11:52] <jouncebot>	 In 6 hour(s) and 48 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T0600)
[23:12:33] <rzl>	 starting a decom that'll involve a scap proxy, any unscheduled deploys let me know so we can avoid a race condition :)
[23:13:45] <logmsgbot>	 !log rzl@cumin2002 conftool action : set/pooled=no; selector: name=mw225[1-57-8].codfw.wmnet
[23:13:51] <logmsgbot>	 !log tstarling@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply
[23:14:17] <logmsgbot>	 !log tstarling@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply
[23:14:41] <logmsgbot>	 !log tstarling@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply
[23:14:58] <logmsgbot>	 !log tstarling@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply
[23:17:33] <logmsgbot>	 !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on 7 hosts with reason: Decom
[23:17:46] <logmsgbot>	 !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 7 hosts with reason: Decom
[23:18:47] <logmsgbot>	 !log rzl@cumin2002 conftool action : set/pooled=inactive; selector: name=mw225[1-57-8].codfw.wmnet
[23:22:59] <wikibugs>	 (03PS2) 10Tim Starling: Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496)
[23:23:07] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling)
[23:24:35] <wikibugs>	 (03Merged) 10jenkins-bot: Increase $wgObjectCacheSessionExpiry to 86400 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816060 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling)
[23:26:06] <logmsgbot>	 !log rzl@cumin2002 START - Cookbook sre.hosts.decommission for hosts mw[2251-2255,2257-2258].codfw.wmnet
[23:27:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:28:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:28:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:29:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:29:18] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: increase wgObjectCacheSessionExpiry to 86400 (duration: 03m 30s)
[23:34:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:35:07] <wikibugs>	 (03PS2) 10Tim Starling: Move CentralAuth sessions to Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816061 (https://phabricator.wikimedia.org/T313496)
[23:35:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:35:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:36:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:36:49] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Move CentralAuth sessions to Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816061 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling)
[23:37:45] <wikibugs>	 (03Merged) 10jenkins-bot: Move CentralAuth sessions to Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816061 (https://phabricator.wikimedia.org/T313496) (owner: 10Tim Starling)
[23:38:23] <logmsgbot>	 !log rzl@cumin2002 START - Cookbook sre.dns.netbox
[23:41:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:42:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:42:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:43:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:43:58] <wikibugs>	 (03PS1) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817914 (https://phabricator.wikimedia.org/T313950)
[23:45:11] <logmsgbot>	 !log rzl@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:45:12] <logmsgbot>	 !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2251-2255,2257-2258].codfw.wmnet
[23:45:51] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: move CentralAuth sessions to Kask T313496 (duration: 05m 34s)
[23:45:55] <stashbot>	 T313496: Harmonize CentralAuth and core session TTL and migrate CentralAuth sessions to Kask - https://phabricator.wikimedia.org/T313496
[23:46:54] <wikibugs>	 (03PS2) 10RLazarus: Decom mw2251-2255,2257,2258 [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730)
[23:47:01] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:47:11] <wikibugs>	 (03Abandoned) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817914 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[23:47:15] <wikibugs>	 (03PS1) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950)
[23:48:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:48:56] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Decom mw2251-2255,2257,2258 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817869 (https://phabricator.wikimedia.org/T313730) (owner: 10RLazarus)
[23:49:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:49:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:49:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:59:08] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: sync again now that scap proxy list is fixed T313730 T313496 (duration: 03m 25s)
[23:59:15] <stashbot>	 T313496: Harmonize CentralAuth and core session TTL and migrate CentralAuth sessions to Kask - https://phabricator.wikimedia.org/T313496
[23:59:15] <stashbot>	 T313730: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730