[00:00:02] (03CR) 10Zabe: [C: 03+2] Use core's PoolCounterClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881466 (https://phabricator.wikimedia.org/T327336) (owner: 10Zabe) [00:00:25] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Lens0021) I'm waiting for this because HyperKitty v1.3.5 provides RSS feed. (https://gitlab.com/mailman/hyperkitty/-/merge_requests/302) [00:00:42] (03Merged) 10jenkins-bot: Use core's PoolCounterClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881466 (https://phabricator.wikimedia.org/T327336) (owner: 10Zabe) [00:01:22] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881466|Use core's PoolCounterClient (T327336)]] [00:01:26] T327336: Undeploy PoolCounter extension from wmf production - https://phabricator.wikimedia.org/T327336 [00:02:57] !log zabe@deploy1002 zabe: Backport for [[gerrit:881466|Use core's PoolCounterClient (T327336)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [00:14:10] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881466|Use core's PoolCounterClient (T327336)]] (duration: 12m 47s) [00:14:14] T327336: Undeploy PoolCounter extension from wmf production - https://phabricator.wikimedia.org/T327336 [00:18:33] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1018.eqiad.wmnet [00:27:53] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1018.eqiad.wmnet [00:28:09] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1025.eqiad.wmnet [00:30:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:46] (03PS2) 10Zabe: Stop loading PoolCounter extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881467 (https://phabricator.wikimedia.org/T327336) [00:34:29] (03PS2) 10Zabe: Remove PoolCounter from extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881468 (https://phabricator.wikimedia.org/T327336) [00:36:49] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1025.eqiad.wmnet [00:38:23] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1026.eqiad.wmnet [00:45:32] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1026.eqiad.wmnet [00:47:55] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1027.eqiad.wmnet [00:48:39] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) ` root@lvs1019:/home/brett# ipvsadm -Lnt 10.2.2.13:6533 # kartotherian Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight Act... [00:53:03] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) From the above it looks like: * ldap-ro and upload both somehow don't have any service established any more and, assuming I haven't done anything wrong, can be re... [00:54:45] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) a:03BCornwall [00:55:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1027.eqiad.wmnet [00:55:25] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1030.eqiad.wmnet [01:03:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1030.eqiad.wmnet [01:06:15] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1031.eqiad.wmnet [01:15:47] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1031.eqiad.wmnet [01:18:04] (03PS1) 10Andrea Denisse: centrallog: Sync centrallog1001 to centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) [01:18:08] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1032.eqiad.wmnet [01:26:01] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1032.eqiad.wmnet [01:26:12] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1033.eqiad.wmnet [01:31:03] (03PS1) 10Andrea Denisse: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) [01:36:02] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1033.eqiad.wmnet [01:44:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2013.codfw.wmnet [01:47:53] (03PS2) 10Andrea Denisse: rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) [01:51:42] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2013.codfw.wmnet [01:52:01] (03PS1) 10Andrea Denisse: logstash: Add centrallog1002 as logsource for logstash tests [puppet] - 10https://gerrit.wikimedia.org/r/882762 (https://phabricator.wikimedia.org/T318778) [01:55:37] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2014.codfw.wmnet [02:03:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2014.codfw.wmnet [02:04:50] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2019.codfw.wmnet [02:07:47] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:49] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:20] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host restbase2019.codfw.wmnet [02:17:47] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:20] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2021.codfw.wmnet [02:22:47] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:01] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2021.codfw.wmnet [02:27:13] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2024.codfw.wmnet [02:35:56] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2024.codfw.wmnet [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T0300) [03:07:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.20 [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/882767 (https://phabricator.wikimedia.org/T325583) [03:07:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.20 [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/882767 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [03:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:23:17] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.20 [core] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/882767 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [03:28:17] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [03:29:49] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [03:33:01] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:36:09] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:53:55] PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1108) taken on 2023-01-24 03:26:49 is 270 MiB, but the previous one was 216 MiB, a change of +24.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T0400) [04:00:51] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:17] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882788 (https://phabricator.wikimedia.org/T325583) [04:01:19] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882788 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [04:01:54] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882788 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [04:02:19] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.20 refs T325583 [04:02:23] T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583 [04:55:21] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.20 refs T325583 (duration: 53m 01s) [04:55:25] T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583 [04:57:30] !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.18 (duration: 02m 07s) [05:38:14] (03PS1) 10KartikMistry: Update cxserver to 2023-01-23-123356-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) [05:58:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [05:58:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [05:58:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2107 (T322618)', diff saved to https://phabricator.wikimedia.org/P43268 and previous config saved to /var/cache/conftool/dbconfig/20230124-055816-ladsgroup.json [05:58:20] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [06:00:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T322618)', diff saved to https://phabricator.wikimedia.org/P43269 and previous config saved to /var/cache/conftool/dbconfig/20230124-060035-ladsgroup.json [06:01:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:01:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:01:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2118 (T322618)', diff saved to https://phabricator.wikimedia.org/P43270 and previous config saved to /var/cache/conftool/dbconfig/20230124-060129-ladsgroup.json [06:03:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T322618)', diff saved to https://phabricator.wikimedia.org/P43271 and previous config saved to /var/cache/conftool/dbconfig/20230124-060345-ladsgroup.json [06:03:49] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [06:15:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P43272 and previous config saved to /var/cache/conftool/dbconfig/20230124-061541-ladsgroup.json [06:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P43273 and previous config saved to /var/cache/conftool/dbconfig/20230124-061852-ladsgroup.json [06:22:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P43274 and previous config saved to /var/cache/conftool/dbconfig/20230124-063048-ladsgroup.json [06:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P43275 and previous config saved to /var/cache/conftool/dbconfig/20230124-063358-ladsgroup.json [06:45:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T322618)', diff saved to https://phabricator.wikimedia.org/P43276 and previous config saved to /var/cache/conftool/dbconfig/20230124-064554-ladsgroup.json [06:45:59] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [06:49:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T322618)', diff saved to https://phabricator.wikimedia.org/P43277 and previous config saved to /var/cache/conftool/dbconfig/20230124-064905-ladsgroup.json [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T0700) [07:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T0700). [07:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:14:00] (03CR) 10Ayounsi: [C: 03+1] logstash: Add PTR resolution to firewall logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [07:15:17] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:51] (03CR) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [07:34:34] (03CR) 10Ayounsi: [C: 03+1] centrallog1002: Add to eqiad anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/882724 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [07:40:41] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878853 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [07:40:52] (03CR) 10WMDE-Fisch: [C: 03+1] Remove Kartographer versioned mapdata flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878853 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [07:41:36] (03PS1) 10Marostegui: instances.yaml: Remove db1106 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/882996 (https://phabricator.wikimedia.org/T327616) [07:42:53] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1106 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/882996 (https://phabricator.wikimedia.org/T327616) (owner: 10Marostegui) [07:43:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1106 from dbctl T327616', diff saved to https://phabricator.wikimedia.org/P43278 and previous config saved to /var/cache/conftool/dbconfig/20230124-074323-marostegui.json [07:43:27] T327616: decommission db1106.eqiad.wmnet - https://phabricator.wikimedia.org/T327616 [07:48:45] (03PS4) 10WMDE-Fisch: Deprecate the EnableMapFrame feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [07:49:04] (03CR) 10WMDE-Fisch: [C: 03+1] "PS4: Trivial rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [07:49:43] (03CR) 10Muehlenhoff: durum: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:50:15] !log installing Linux 5.10.162 on Bullseye hosts [07:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:28] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/882769 (https://phabricator.wikimedia.org/T327738) [07:53:30] (03PS1) 10Bartosz Dziewoński: Add "Page Frame" to DiscussionTools beta feature on almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883098 (https://phabricator.wikimedia.org/T323727) [07:53:32] (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/882770 (https://phabricator.wikimedia.org/T327738) [07:56:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/882771 (https://phabricator.wikimedia.org/T327739) [07:57:10] (03Abandoned) 10Marostegui: mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/882769 (https://phabricator.wikimedia.org/T327738) (owner: 10Gerrit maintenance bot) [07:57:23] (03Abandoned) 10Marostegui: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/882770 (https://phabricator.wikimedia.org/T327738) (owner: 10Gerrit maintenance bot) [07:58:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2110 with weight 0 T327739', diff saved to https://phabricator.wikimedia.org/P43279 and previous config saved to /var/cache/conftool/dbconfig/20230124-075824-root.json [07:58:29] T327739: Switchover s4 master (db2140 -> db2110) - https://phabricator.wikimedia.org/T327739 [07:58:36] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T327739 [07:59:11] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T327739 [07:59:41] (03PS2) 10KartikMistry: Content Translation: Add campaign for Wiki Loves Living Heritage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882266 (https://phabricator.wikimedia.org/T327587) [08:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T0800). [08:00:05] kart_, MatmaRex, and Dreamy_Jazz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:18] \o [08:00:45] * kart_ is here [08:01:31] hi [08:02:49] (03CR) 10Filippo Giunchedi: "LGTM, a suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T313858) (owner: 10Andrea Denisse) [08:03:20] I will go ahead with my patch.. [08:04:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882266 (https://phabricator.wikimedia.org/T327587) (owner: 10KartikMistry) [08:04:58] (03Merged) 10jenkins-bot: Content Translation: Add campaign for Wiki Loves Living Heritage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882266 (https://phabricator.wikimedia.org/T327587) (owner: 10KartikMistry) [08:05:00] (03PS1) 10Elukey: role::kafka::jumbo::broker: update firewall rules for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/883100 [08:05:46] !log kartik@deploy1002 Started scap: Backport for [[gerrit:882266|Content Translation: Add campaign for Wiki Loves Living Heritage (T327587)]] [08:05:50] T327587: Request campaign tag for Wiki Loves Living Heritage in Content Translation - https://phabricator.wikimedia.org/T327587 [08:06:05] (03CR) 10Filippo Giunchedi: "You can more simply change the hosts in the existing quickdatacopy and set ensure => present, no need to keep eqiad/codfw IMHO" [puppet] - 10https://gerrit.wikimedia.org/r/882760 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [08:06:12] kart_: would you be able to deploy the other patches as well afterwards? i don't have deployment access, and i think neither does Dreamy_Jazz [08:06:23] I do not either :) [08:06:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/883100 (owner: 10Elukey) [08:07:08] (03PS2) 10Dreamy Jazz: Enable write new for CheckUserLog comment fields on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882240 (https://phabricator.wikimedia.org/T233004) [08:07:17] (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/882771 (https://phabricator.wikimedia.org/T327739) (owner: 10Gerrit maintenance bot) [08:07:44] !log kartik@deploy1002 kartik: Backport for [[gerrit:882266|Content Translation: Add campaign for Wiki Loves Living Heritage (T327587)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:08:24] MatmaRex: sure. [08:08:35] Thanks [08:09:57] Also anyone around who has permissions to open Special:CheckUserLog on test wikis? [08:10:12] Will need someone who has that to test my config change [08:10:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/882771 (https://phabricator.wikimedia.org/T327739) (owner: 10Gerrit maintenance bot) [08:11:27] (03PS1) 10Marostegui: db2140: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883102 [08:11:41] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:49] (03CR) 10Marostegui: [C: 03+2] db2140: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883102 (owner: 10Marostegui) [08:13:04] (03PS2) 10KartikMistry: Add "Page Frame" to DiscussionTools beta feature on almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883098 (https://phabricator.wikimedia.org/T323727) (owner: 10Bartosz Dziewoński) [08:16:12] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:882266|Content Translation: Add campaign for Wiki Loves Living Heritage (T327587)]] (duration: 10m 25s) [08:16:16] T327587: Request campaign tag for Wiki Loves Living Heritage in Content Translation - https://phabricator.wikimedia.org/T327587 [08:16:42] MatmaRex: on your change now.. [08:16:52] thanks [08:18:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883098 (https://phabricator.wikimedia.org/T323727) (owner: 10Bartosz Dziewoński) [08:18:24] !log Starting s4 codfw failover from db2140 to db2110 - T327739 [08:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:28] T327739: Switchover s4 master (db2140 -> db2110) - https://phabricator.wikimedia.org/T327739 [08:19:04] (03Merged) 10jenkins-bot: Add "Page Frame" to DiscussionTools beta feature on almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883098 (https://phabricator.wikimedia.org/T323727) (owner: 10Bartosz Dziewoński) [08:19:26] !log kartik@deploy1002 Started scap: Backport for [[gerrit:883098|Add "Page Frame" to DiscussionTools beta feature on almost all wikis (T323727)]] [08:19:28] Dreamy_Jazz: Did you find anyone with needed permission? I don't have it either :/ [08:19:30] T323727: [Config Change] Enable Page Frame as beta feature at Phase 1 wikis (desktop) - https://phabricator.wikimedia.org/T323727 [08:19:39] Hmm. Not yet. [08:19:42] zabe: [08:20:24] Urbanecm: [08:20:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2110 to s4 primary T327739', diff saved to https://phabricator.wikimedia.org/P43280 and previous config saved to /var/cache/conftool/dbconfig/20230124-082025-root.json [08:20:43] Zabe and Urbanecm were able to test it last time. [08:21:13] !log kartik@deploy1002 kartik and matmarex: Backport for [[gerrit:883098|Add "Page Frame" to DiscussionTools beta feature on almost all wikis (T323727)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:21:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2140 T327739', diff saved to https://phabricator.wikimedia.org/P43281 and previous config saved to /var/cache/conftool/dbconfig/20230124-082138-marostegui.json [08:22:03] MatmaRex: Please test on mwdebug2001/1002/1001/2002 [08:22:19] kart_: looks good [08:22:33] cool. Deploying now.. [08:23:13] I think any steward could help test this [08:24:33] i suppose i would have access with my WMF staff account, or at least i could grant myself access, but i'm not sure whether it's appropriate to use it for testing like this [08:24:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2110 from API T327739', diff saved to https://phabricator.wikimedia.org/P43282 and previous config saved to /var/cache/conftool/dbconfig/20230124-082440-marostegui.json [08:24:44] T327739: Switchover s4 master (db2140 -> db2110) - https://phabricator.wikimedia.org/T327739 [08:25:11] i've never used checkuser tools, except while testing patches locally [08:25:28] It would only be to load the page Special:CheckUserLog [08:25:58] But you have to have the checkuser-log right [08:26:12] I've asked in the #wikimedia-stewards channel for anyone that could help [08:26:40] i think i could do that if no one else is available [08:27:07] Okay. No response from a steward yet. You would only need the checkuser-log right. [08:27:13] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for eoghan - https://phabricator.wikimedia.org/T327743 (10Jelto) [08:27:50] (03PS1) 10Marostegui: Revert "db2140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882728 [08:28:35] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:883098|Add "Page Frame" to DiscussionTools beta feature on almost all wikis (T323727)]] (duration: 09m 09s) [08:28:39] T323727: [Config Change] Enable Page Frame as beta feature at Phase 1 wikis (desktop) - https://phabricator.wikimedia.org/T323727 [08:29:21] MatmaRex: Deployed. Will you test Dreamy_Jazz's config change? [08:30:10] thanks. yeah [08:30:30] Cool. Let me check the patch first.. :) [08:31:04] (03CR) 10Marostegui: [C: 03+2] Revert "db2140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882728 (owner: 10Marostegui) [08:31:12] (03PS3) 10KartikMistry: Enable write new for CheckUserLog comment fields on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882240 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [08:32:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43283 and previous config saved to /var/cache/conftool/dbconfig/20230124-083200-root.json [08:32:05] Is there room to sneak in two clean mw-config patches? :-) [08:32:15] *clean-up [08:32:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882240 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [08:33:07] (03Merged) 10jenkins-bot: Enable write new for CheckUserLog comment fields on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882240 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [08:33:15] WMDE-Fisch: Please add :) [08:33:31] !log kartik@deploy1002 Started scap: Backport for [[gerrit:882240|Enable write new for CheckUserLog comment fields on testwikis (T233004)]] [08:33:33] The test steps for my one would be to load Special:CheckUserLog, and filter for the text "Testing gerrit:879652, now with mwdebug enabled". It should show two entries if the change is correct. [08:33:34] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [08:33:59] kart_: Thanks, done! [08:34:34] !log phedenskog@deploy1002 Started deploy [performance/navtiming@8c87ca6]: (no justification provided) [08:34:40] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@8c87ca6]: (no justification provided) (duration: 00m 06s) [08:34:43] Can be merged in any order. They should have no effect and are long gone in the extension. [08:34:47] ok i remembered my password, now i need to give myself the rights [08:35:01] (03PS1) 10Ayounsi: Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/883106 (https://phabricator.wikimedia.org/T316532) [08:35:16] !log kartik@deploy1002 dreamyjazz and kartik: Backport for [[gerrit:882240|Enable write new for CheckUserLog comment fields on testwikis (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:35:28] MatmaRex: Please test Dreamy_Jazz's patch with above test description :) [08:35:30] Do staff account have checkuser-log globally already? [08:35:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T327745 [08:35:49] T327745: Switchover x1 codfw master db2096 -> db2115 - https://phabricator.wikimedia.org/T327745 [08:36:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T327745 [08:36:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2115 with weight 0 T327745', diff saved to https://phabricator.wikimedia.org/P43284 and previous config saved to /var/cache/conftool/dbconfig/20230124-083643-marostegui.json [08:36:52] Actually, it should show one entry. I got a little confused. [08:37:09] So the correct steps this time :) [08:37:10] Dreamy_Jazz: by default staff accounts have nothing. but the "staff" global group grants rights to grant yourself other rights. and i just granted the rights to myself https://test.wikipedia.org/wiki/Special:UserRights/Bartosz_Dziewoński_(WMF) [08:37:45] (03PS1) 10Marostegui: mariadb: Promote db2115 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/883107 (https://phabricator.wikimedia.org/T327745) [08:37:47] Load Special:CheckUserLog on test.wikipedia.org, filter for the reason "Testing gerrit:879652, now with mwdebug enabled" and one entry should appear [08:37:50] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [08:37:57] If no entries appear then the config change hasn't been made. [08:38:03] i'm looking at https://test.wikipedia.org/wiki/Special:CheckUserLog now. i see test entries from the last time you were testing config changes [08:38:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2115 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/883107 (https://phabricator.wikimedia.org/T327745) (owner: 10Marostegui) [08:39:02] !log Starting x1 codfw failover from db2096 to db2115 - T327745 [08:39:03] Dreamy_Jazz: i see an entry with that reason from 17 January, is that it? [08:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:38] (03CR) 10Hashar: opensearch: make upgrade-phatality.sh stricter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [08:39:41] It should the entry referenced in https://phabricator.wikimedia.org/P43169 [08:39:55] So yes. [08:40:33] yes, the ID matches [08:40:36] (03PS5) 10KartikMistry: Deprecate the EnableMapFrame feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [08:40:39] i guess that works then [08:40:47] i was just a bit confused since it's a week old [08:40:57] Yeah. I selected it as I knew it's reason [08:40:58] cool. Thanks MatmaRex. [08:41:19] But when read new is enabled, the searching mechanism for the reason field changes [08:41:44] So I used a entry I knew about as a method to test that the searching mechanism changed [08:41:45] oh, i see [08:41:55] yeah, it doesn't appear when i'm not on mwdebug [08:42:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2115 to x1 codfw T327745', diff saved to https://phabricator.wikimedia.org/P43285 and previous config saved to /var/cache/conftool/dbconfig/20230124-084206-marostegui.json [08:42:10] T327745: Switchover x1 codfw master db2096 -> db2115 - https://phabricator.wikimedia.org/T327745 [08:42:18] Cool. That works then. Thanks :) [08:42:27] :) [08:42:45] I'll go ahead with the deployment now. Good testing! :) [08:44:27] WMDE-Fisch: Deploying your first patch in a few minutes.. [08:44:33] +1 [08:45:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2096 T327745', diff saved to https://phabricator.wikimedia.org/P43286 and previous config saved to /var/cache/conftool/dbconfig/20230124-084508-marostegui.json [08:45:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add some weight to db2115 in x1 codfw', diff saved to https://phabricator.wikimedia.org/P43287 and previous config saved to /var/cache/conftool/dbconfig/20230124-084552-marostegui.json [08:47:00] (03PS1) 10Marostegui: db2096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883110 (https://phabricator.wikimedia.org/T327745) [08:47:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43288 and previous config saved to /var/cache/conftool/dbconfig/20230124-084705-root.json [08:47:40] (03CR) 10Marostegui: [C: 03+2] db2096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883110 (https://phabricator.wikimedia.org/T327745) (owner: 10Marostegui) [08:47:51] (03CR) 10DCausse: [C: 03+1] dse-k8s: add rdf-streaming-update-ng namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [08:48:52] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:882240|Enable write new for CheckUserLog comment fields on testwikis (T233004)]] (duration: 15m 20s) [08:48:55] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [08:49:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [08:49:15] Thanks! [08:49:55] (03Merged) 10jenkins-bot: Deprecate the EnableMapFrame feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [08:50:19] !log kartik@deploy1002 Started scap: Backport for [[gerrit:875463|Deprecate the EnableMapFrame feature flag (T326288)]] [08:50:22] T326288: Deprecate some Kartographer feature flags - https://phabricator.wikimedia.org/T326288 [08:50:53] (03CR) 10Jelto: [C: 03+2] gitlab: exclude shell scripts and other backups from rsync jobs [puppet] - 10https://gerrit.wikimedia.org/r/882704 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:52:03] !log kartik@deploy1002 awight and kartik: Backport for [[gerrit:875463|Deprecate the EnableMapFrame feature flag (T326288)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:52:41] (03PS1) 10Marostegui: Revert "db2096: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882730 [08:52:46] WMDE-Fisch: Please test first patch on mwdebug1001/2001/1002/2002.. [08:54:04] kart_: All good, go on :-) [08:54:24] (03CR) 10Marostegui: [C: 03+2] Revert "db2096: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882730 (owner: 10Marostegui) [08:55:01] (03PS1) 10Slyngshede: LDAP property editor [software/bitu] - 10https://gerrit.wikimedia.org/r/883111 [08:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43289 and previous config saved to /var/cache/conftool/dbconfig/20230124-085501-root.json [08:55:18] Cool. Deploying. [08:56:14] FYI these flags have no meaning anymore. I just run 1-2 sanity checks. But in general we should be very fine. [08:56:54] Noted. Thanks. Good to test while deploying though :) [08:57:05] Yep [08:57:39] (03PS1) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883112 [08:58:02] (03CR) 10Marostegui: [C: 03+2] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883112 (owner: 10Marostegui) [09:00:04] (03CR) 10JMeybohm: dse-k8s: add rdf-streaming-update-ng namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [09:01:01] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:875463|Deprecate the EnableMapFrame feature flag (T326288)]] (duration: 10m 42s) [09:01:05] T326288: Deprecate some Kartographer feature flags - https://phabricator.wikimedia.org/T326288 [09:01:39] (03PS1) 10Slyngshede: Switch to built in LogoutView. [software/bitu] - 10https://gerrit.wikimedia.org/r/883113 [09:02:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43290 and previous config saved to /var/cache/conftool/dbconfig/20230124-090210-root.json [09:02:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878853 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [09:02:33] Going with the last patch.. [09:02:36] 10SRE, 10Icinga, 10SRE Observability, 10serviceops: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10Joe) p:05Triage→03Medium a:03Joe [09:02:47] (I know we're overtime :/) [09:03:06] (03Merged) 10jenkins-bot: Remove Kartographer versioned mapdata flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878853 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [09:03:36] !log kartik@deploy1002 Started scap: Backport for [[gerrit:878853|Remove Kartographer versioned mapdata flags (T326288)]] [09:05:06] (03PS1) 10Marostegui: Revert "db2093: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882731 [09:05:22] !log kartik@deploy1002 awight and kartik: Backport for [[gerrit:878853|Remove Kartographer versioned mapdata flags (T326288)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [09:05:31] WMDE-Fisch: quick testing of the second patch please :) [09:05:33] (03CR) 10Marostegui: [C: 03+2] Revert "db2093: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882731 (owner: 10Marostegui) [09:05:46] kart_: Sure! [09:06:42] (03PS1) 10EoghanGaffney: Add account for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/883114 [09:07:02] kart_: All fine! [09:07:31] cool [09:10:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43291 and previous config saved to /var/cache/conftool/dbconfig/20230124-091006-root.json [09:13:20] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:878853|Remove Kartographer versioned mapdata flags (T326288)]] (duration: 09m 44s) [09:13:24] T326288: Deprecate some Kartographer feature flags - https://phabricator.wikimedia.org/T326288 [09:14:15] !log Done: UTC morning backport window [09:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:24] WMDE-Fisch: done. [09:14:32] Thanks! [09:16:03] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:16:03] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:17:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43292 and previous config saved to /var/cache/conftool/dbconfig/20230124-091715-root.json [09:17:25] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:17:25] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43293 and previous config saved to /var/cache/conftool/dbconfig/20230124-092511-root.json [09:26:52] (03CR) 10Clément Goubert: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [09:28:54] (03PS1) 10Muehlenhoff: Remove ldap-corp Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/883116 [09:30:15] (03PS1) 10Elukey: services: change liftwing's test rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/883117 (https://phabricator.wikimedia.org/T327302) [09:31:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove ldap-corp Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/883116 (owner: 10Muehlenhoff) [09:32:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43294 and previous config saved to /var/cache/conftool/dbconfig/20230124-093220-root.json [09:36:04] (03PS1) 10Marostegui: db1115: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883119 [09:36:32] (03CR) 10Marostegui: [C: 03+2] db1115: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883119 (owner: 10Marostegui) [09:37:03] (03CR) 10Elukey: [C: 03+2] services: change liftwing's test rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/883117 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [09:39:26] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [09:39:28] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2041.codfw.wmnet with OS bullseye [09:39:37] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:40:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43295 and previous config saved to /var/cache/conftool/dbconfig/20230124-094016-root.json [09:41:39] !log installing libtasn1-6 security updates on buster [09:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:04] (03CR) 10Filippo Giunchedi: "siiigh I thought I sent the comment, and didn't!" [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [09:43:12] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:43:26] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:44:20] (03PS1) 10Marostegui: Revert "db1115: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882735 [09:44:46] (03CR) 10Marostegui: [C: 03+2] Revert "db1115: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882735 (owner: 10Marostegui) [09:47:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43296 and previous config saved to /var/cache/conftool/dbconfig/20230124-094725-root.json [09:50:48] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/882772 (https://phabricator.wikimedia.org/T327754) [09:52:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 35 hosts with reason: Primary switchover s8 T327754 [09:52:24] T327754: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T327754 [09:52:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2161 with weight 0 T327754', diff saved to https://phabricator.wikimedia.org/P43297 and previous config saved to /var/cache/conftool/dbconfig/20230124-095235-marostegui.json [09:52:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 35 hosts with reason: Primary switchover s8 T327754 [09:53:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/882772 (https://phabricator.wikimedia.org/T327754) (owner: 10Gerrit maintenance bot) [09:55:20] (03PS1) 10Marostegui: db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883121 (https://phabricator.wikimedia.org/T327754) [09:55:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43298 and previous config saved to /var/cache/conftool/dbconfig/20230124-095520-root.json [09:55:36] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2041.codfw.wmnet with reason: host reimage [09:55:40] (03CR) 10Marostegui: [C: 03+2] db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883121 (https://phabricator.wikimedia.org/T327754) (owner: 10Marostegui) [09:57:02] (03PS1) 10Dreamy Jazz: Enable write new for CheckUserLog comment fields on group 0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883122 (https://phabricator.wikimedia.org/T233004) [09:58:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2041.codfw.wmnet with reason: host reimage [09:58:55] (03PS7) 10Jaime Nuche: jenkins: add remaining config for Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/860837 (https://phabricator.wikimedia.org/T323909) [09:58:56] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [09:59:25] (03CR) 10Jaime Nuche: jenkins: add remaining config for Scap3 deployment (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/860837 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [09:59:49] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:59:51] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: apply on main [10:00:47] (03PS2) 10EoghanGaffney: Add account for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/883114 [10:03:28] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/860837/39216/" [puppet] - 10https://gerrit.wikimedia.org/r/860837 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [10:10:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2096 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43299 and previous config saved to /var/cache/conftool/dbconfig/20230124-101025-root.json [10:12:30] (03PS3) 10EoghanGaffney: Add production ssh account for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/883114 [10:13:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2041.codfw.wmnet with OS bullseye [10:14:00] !log Starting s8 codfw failover from db2165 to db2161 - T327754 [10:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:04] T327754: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T327754 [10:14:50] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:15:02] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:16:54] (03CR) 10Jelto: [C: 03+1] "verified SSH key over second channel" [puppet] - 10https://gerrit.wikimedia.org/r/883114 (owner: 10EoghanGaffney) [10:17:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2161 to s8 primary T327754', diff saved to https://phabricator.wikimedia.org/P43300 and previous config saved to /var/cache/conftool/dbconfig/20230124-101727-root.json [10:17:47] !log depooling maps from equad && pooling maps on codfw [10:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:49] (03CR) 10Jelto: [C: 03+2] Add production ssh account for eoghan [puppet] - 10https://gerrit.wikimedia.org/r/883114 (owner: 10EoghanGaffney) [10:17:59] eqiad, damn:p [10:18:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2165 T327754', diff saved to https://phabricator.wikimedia.org/P43301 and previous config saved to /var/cache/conftool/dbconfig/20230124-101825-root.json [10:19:33] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [10:19:46] !log rolling Apache/FPM restarts on mw canaries to pick up libtasn security update [10:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:10] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:22:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:56] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [10:25:04] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:25:32] (03PS1) 10Marostegui: Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882738 [10:25:50] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for eoghan - https://phabricator.wikimedia.org/T327743 (10Jelto) p:05Triage→03Medium [10:27:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43302 and previous config saved to /var/cache/conftool/dbconfig/20230124-102730-root.json [10:27:32] (03CR) 10Marostegui: [C: 03+2] Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/882738 (owner: 10Marostegui) [10:28:22] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:44] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for eoghan - https://phabricator.wikimedia.org/T327743 (10Jelto) 05Open→03Resolved I added user `eoghan` to ldap group `wmf` and `ops`. User is present in admin/data.yaml too: https://gerrit.wikimedia.org/r/q/883114 [10:28:50] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:28] !log depool cp4046 [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:31] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:03] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:30:07] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:54] (03PS1) 10EoghanGaffney: Add eoghan shell account to ops group [puppet] - 10https://gerrit.wikimedia.org/r/883125 [10:31:04] !log restarting varnish on cp4046 [10:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:21] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:48] (03CR) 10Jelto: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/883125 (owner: 10EoghanGaffney) [10:32:58] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:33:31] !log repool cp4046 [10:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:35] (03CR) 10EoghanGaffney: [C: 03+2] Add eoghan shell account to ops group [puppet] - 10https://gerrit.wikimedia.org/r/883125 (owner: 10EoghanGaffney) [10:35:21] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:13] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:35] (03PS1) 10Elukey: changeprop: update liftwing's rule templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/883128 (https://phabricator.wikimedia.org/T327302) [10:37:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:38:25] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:01] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:33] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10Vgutierrez) cp4046 has been impacted by the same issue a few minutes ago [10:40:50] (03PS1) 10Marostegui: instances.yaml: Re-add db1106, remove db1176 [puppet] - 10https://gerrit.wikimedia.org/r/883129 (https://phabricator.wikimedia.org/T326116) [10:41:14] (03PS2) 10Elukey: changeprop: update liftwing's rule templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/883128 (https://phabricator.wikimedia.org/T327302) [10:41:38] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Re-add db1106, remove db1176 [puppet] - 10https://gerrit.wikimedia.org/r/883129 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [10:42:13] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1176 from s1 T326116', diff saved to https://phabricator.wikimedia.org/P43303 and previous config saved to /var/cache/conftool/dbconfig/20230124-104219-root.json [10:42:22] (03PS1) 10JMeybohm: Add prometheus user to the system:monitoring group [labs/private] - 10https://gerrit.wikimedia.org/r/883130 (https://phabricator.wikimedia.org/T307943) [10:42:24] T326116: Package and test MariaDB 11 - https://phabricator.wikimedia.org/T326116 [10:42:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43304 and previous config saved to /var/cache/conftool/dbconfig/20230124-104235-root.json [10:42:37] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:39] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:42:56] uh, nice to see that work :D [10:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1106 to dbctl in s1 T326116', diff saved to https://phabricator.wikimedia.org/P43305 and previous config saved to /var/cache/conftool/dbconfig/20230124-104336-marostegui.json [10:44:46] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add prometheus user to the system:monitoring group [labs/private] - 10https://gerrit.wikimedia.org/r/883130 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:45:00] (03PS1) 10Marostegui: mariadb: Install MariaDB 11 on db1106 [puppet] - 10https://gerrit.wikimedia.org/r/883133 (https://phabricator.wikimedia.org/T326116) [10:46:39] (03PS2) 10Marostegui: mariadb: Install MariaDB 11 on db1106 [puppet] - 10https://gerrit.wikimedia.org/r/883133 (https://phabricator.wikimedia.org/T326116) [10:46:53] (03CR) 10Vlad.shapik: Add a longer list of thumbor local configs and fix make online-test command (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/881909 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [10:47:10] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/883106 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [10:48:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Install MariaDB 11 on db1106 [puppet] - 10https://gerrit.wikimedia.org/r/883133 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [10:49:01] 10SRE, 10cloud-services-team: Fix all .erb variable warnings - https://phabricator.wikimedia.org/T97251 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Being bold and closing this task since it's had no meaningful update in 7 years. Feel free to reopen if needed. [10:49:43] !log depool ulsfo for network maintenance - T316532 [10:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:47] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [10:49:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] thumbor: add and use haproxy healthz lvs check [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:50:01] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:02] we have done the maps alerts [10:51:09] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:52] (03CR) 10Hnowlan: [C: 03+1] "lgtm - as noted on irc, removing cases will probably not change behaviour but this is the correct formatting for a single case" [deployment-charts] - 10https://gerrit.wikimedia.org/r/883128 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [10:52:07] (03PS1) 10Elukey: services: increase verbosity of changeprop logs in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883135 (https://phabricator.wikimedia.org/T327302) [10:52:19] (03PS1) 10Marostegui: mariadb: Move db1176 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/883136 (https://phabricator.wikimedia.org/T327762) [10:52:34] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [10:52:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1176 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/883136 (https://phabricator.wikimedia.org/T327762) (owner: 10Marostegui) [10:53:40] (03CR) 10Hnowlan: [C: 03+1] services: increase verbosity of changeprop logs in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883135 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [10:54:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1176.eqiad.wmnet with OS bullseye [10:54:20] (03CR) 10Elukey: [C: 03+2] changeprop: update liftwing's rule templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/883128 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [10:55:28] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:55:41] 10SRE, 10Traffic, 10Patch-For-Review, 10Upstream: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez) both cp2041 and cp2042 look good to me. I haven't found any reason that would prevent upgrading to bullseye [10:56:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:39] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:57:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43306 and previous config saved to /var/cache/conftool/dbconfig/20230124-105740-root.json [10:58:15] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [10:58:26] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [10:58:28] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:59:10] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:23] (03CR) 10Hnowlan: [C: 03+2] Add a longer list of thumbor local configs and fix make online-test command [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/881909 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [10:59:50] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:50] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [11:00:00] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1100) [11:00:51] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:01:04] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:01:06] (03CR) 10Elukey: [C: 03+2] services: increase verbosity of changeprop logs in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883135 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [11:02:25] !log depooling maps (kartotherian) from codfw, leaving eqiad as pooled [11:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:10] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [11:03:23] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [11:03:31] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [11:03:36] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:34] 10SRE-Access-Requests: Security Issue Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10pfischer) [11:05:00] 10SRE-Access-Requests: Procurement Tickets Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10Aklapper) [11:05:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage [11:06:25] dbproxy alerts are to be expected on irc [11:06:36] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:06:48] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:06:59] (03Merged) 10jenkins-bot: Add a longer list of thumbor local configs and fix make online-test command [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/881909 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [11:08:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage [11:08:35] 10SRE-Access-Requests: Procurement Tickets Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10MoritzMuehlenhoff) Hi Peter, we don't generally use Phabricator tasks for this, you can instead simply mail security-help@wikimedia.org, they'll add you. [11:09:00] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:09:39] (03CR) 10Jbond: [C: 03+1] rspamd: vendor github.com/oxc/puppet-rspamd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) (owner: 10JHathaway) [11:09:42] 10SRE-Access-Requests: Procurement Tickets Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10Aklapper) @MoritzMuehlenhoff This is about S4 and not security issues [11:10:05] 10SRE-Access-Requests: Procurement Tickets Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10MoritzMuehlenhoff) >>! In T327765#8552827, @MoritzMuehlenhoff wrote: > Hi Peter, we don't generally use Phabricator tasks for this, you can instead simply mail security-help@wikimedia.org, they'l... [11:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:11:03] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:11:27] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:12:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [11:12:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43308 and previous config saved to /var/cache/conftool/dbconfig/20230124-111245-root.json [11:13:46] jouncebot, nowandnext [11:13:47] For the next 0 hour(s) and 46 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1100) [11:13:47] In 2 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1400) [11:13:47] In 2 hour(s) and 46 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1400) [11:14:10] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:50] (03CR) 10Zabe: [C: 03+2] Stop loading PoolCounter extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881467 (https://phabricator.wikimedia.org/T327336) (owner: 10Zabe) [11:16:31] (03Merged) 10jenkins-bot: Stop loading PoolCounter extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881467 (https://phabricator.wikimedia.org/T327336) (owner: 10Zabe) [11:17:08] (03PS1) 10Muehlenhoff: Remove old ping hosts [puppet] - 10https://gerrit.wikimedia.org/r/883137 (https://phabricator.wikimedia.org/T273509) [11:17:33] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881467|Stop loading PoolCounter extension (T327336)]] [11:17:37] T327336: Undeploy PoolCounter extension from wmf production - https://phabricator.wikimedia.org/T327336 [11:19:14] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:19:19] !log zabe@deploy1002 zabe: Backport for [[gerrit:881467|Stop loading PoolCounter extension (T327336)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [11:21:25] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/883100 (owner: 10Elukey) [11:21:25] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ping3002.esams.wmnet [11:21:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:22:01] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:22:20] (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::broker: update firewall rules for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/883100 (owner: 10Elukey) [11:23:39] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:25:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1176.eqiad.wmnet with OS bullseye [11:26:53] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881467|Stop loading PoolCounter extension (T327336)]] (duration: 09m 19s) [11:26:56] T327336: Undeploy PoolCounter extension from wmf production - https://phabricator.wikimedia.org/T327336 [11:27:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43310 and previous config saved to /var/cache/conftool/dbconfig/20230124-112750-root.json [11:28:02] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:29:35] (03PS1) 10Marostegui: site.pp: Add db1106 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/883140 [11:29:37] (03PS1) 10Marostegui: site.pp: Add db1206 as sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883141 (https://phabricator.wikimedia.org/T326669) [11:29:57] (03CR) 10Jbond: puppetmaster: add prometheus metrics for cert expiration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [11:30:10] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db1206 as sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/883141 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [11:30:18] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db1106 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/883140 (owner: 10Marostegui) [11:33:17] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:33:51] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:34:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:34:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:35:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping3002.esams.wmnet [11:35:08] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ping3002.esams.wmnet` - ping3002.... [11:35:48] 10SRE, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Data check es2020 after replication broke - https://phabricator.wikimedia.org/T327770 (10jcrespo) [11:36:30] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:36:36] 10SRE, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Data check es2020 after replication broke - https://phabricator.wikimedia.org/T327770 (10jcrespo) It's a single command+wait so not much overhead, I prefer to mostly block it for now until backup finishes. [11:36:47] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [11:38:54] (03CR) 10Jbond: Add production ssh account for eoghan (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883114 (owner: 10EoghanGaffney) [11:39:05] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:39:10] 10SRE, 10DBA, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Data check es2020 after replication broke - https://phabricator.wikimedia.org/T327770 (10Marostegui) Sounds good, let me know if you need me :) [11:42:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43311 and previous config saved to /var/cache/conftool/dbconfig/20230124-114255-root.json [11:44:20] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:37] 10SRE, 10SRE-Access-Requests: Requesting access to WMF Production for Kavitha Appakayala - https://phabricator.wikimedia.org/T327450 (10akosiaris) 05In progress→03Invalid I 've synced up with Kavitha as her onboarding buddy (this ticket is actually part of the onboarding process for SRE). We will be follow... [11:49:16] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:50:30] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:52:25] (03PS7) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 [11:54:07] !log uploaded python3-gjson_1.0.0 to apt.wikimedia.org bullseye-wikimedia,unstable-wikimedia [11:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:58] (03PS1) 10Urbanecm: Remove GEMentorProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) [11:55:28] (03CR) 10Urbanecm: [C: 04-2] "pending full deployment of wmf.20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm) [11:56:08] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] thumbor: add and use haproxy healthz lvs check [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:56:32] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/881365 (owner: 10Slyngshede) [11:57:16] (03CR) 10Jbond: [C: 03+1] "we should be able to progress with this now" [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [11:58:00] 10SRE-swift-storage, 10Wikimedia-production-error: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T327681 (10MatthewVernon) It's really not very straightforward to trace this request into what went on in swift. Logstash shows me an error timestamped at `2023-01-23T18:46:27+00:00... [12:04:12] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:04:32] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:07:43] (03PS1) 10Majavah: P:ssh::server: fix stretch compat [puppet] - 10https://gerrit.wikimedia.org/r/883156 [12:10:55] (03CR) 10Zabe: [C: 03+2] Remove PoolCounter from extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881468 (https://phabricator.wikimedia.org/T327336) (owner: 10Zabe) [12:11:52] (03Merged) 10jenkins-bot: Remove PoolCounter from extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881468 (https://phabricator.wikimedia.org/T327336) (owner: 10Zabe) [12:12:40] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881468|Remove PoolCounter from extension-list (T327336)]] [12:12:44] T327336: Undeploy PoolCounter extension from wmf production - https://phabricator.wikimedia.org/T327336 [12:13:28] (03CR) 10FNegri: [C: 03+2] "Thanks, merging!" [puppet] - 10https://gerrit.wikimedia.org/r/883156 (owner: 10Majavah) [12:15:25] (03CR) 10Hnowlan: [C: 03+2] thumbor: add failure condition to health check [deployment-charts] - 10https://gerrit.wikimedia.org/r/881635 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:19:37] 10SRE, 10SRE-Access-Requests: Procurement Tickets Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10Clement_Goubert) According to #acl*procurement-review access needs to be granted by @mark or @RobH [12:20:29] (03Merged) 10jenkins-bot: thumbor: add failure condition to health check [deployment-charts] - 10https://gerrit.wikimedia.org/r/881635 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:20:51] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883151 (https://phabricator.wikimedia.org/T239862) (owner: 10Clément Goubert) [12:21:04] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thumbor2004.codfw.wmnet [12:32:34] 10SRE-swift-storage, 10Wikimedia-production-error: High rate of upload 502 errors/timeouts (was FileBackendError: Iterator page I/O error.) - https://phabricator.wikimedia.org/T327681 (10jcrespo) [12:32:44] 10SRE-swift-storage, 10Wikimedia-production-error: High rate of upload 502 errors/timeouts (was FileBackendError: Iterator page I/O error.) - https://phabricator.wikimedia.org/T327681 (10jcrespo) p:05Triage→03High [12:35:53] (03CR) 10Filippo Giunchedi: puppetmaster: add prometheus metrics for cert expiration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [12:36:01] 10SRE-swift-storage, 10Wikimedia-production-error: High rate of upload 502 errors/timeouts (was FileBackendError: Iterator page I/O error.) - https://phabricator.wikimedia.org/T327681 (10jcrespo) Some worrying pattern of errors was discovered at: * https://grafana.wikimedia.org/goto/AuNJBfTVz?orgId=1 * https:... [12:38:26] !log zabe@deploy1002 zabe: Backport for [[gerrit:881468|Remove PoolCounter from extension-list (T327336)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [12:38:31] T327336: Undeploy PoolCounter extension from wmf production - https://phabricator.wikimedia.org/T327336 [12:38:34] (03CR) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [12:40:21] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-proxies rolling restart_daemons on A:eqiad and A:swift-fe or A:thanos-fe [12:40:33] 10SRE, 10SRE-Access-Requests: Requesting access to Jupyter for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Samwalton9) [12:41:03] 10SRE, 10SRE-Access-Requests: Requesting access to Jupyter for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Samwalton9) [12:41:39] 10SRE, 10SRE-Access-Requests: Requesting access to Jupyter for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Samwalton9) Do I need approval again? I got it already in T277298 for the private data access. [12:43:52] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 36 hosts with reason: nework maintenance [12:44:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 36 hosts with reason: nework maintenance [12:44:35] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=795679e1-6c07-4196-8280-0cef7454587d) set by ayounsi@cumin1001 fo... [12:48:01] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Ollie.Shotton_WMDE) Successfully SSHed in and reset Kerberos password. Thanks! [12:48:37] 10SRE, 10SRE-Access-Requests: Requesting access to Jupyter for Sam Walton - https://phabricator.wikimedia.org/T327780 (10jcrespo) @Samwalton9 Not the person that will manage your request, but access additions after the first access was granted is a simplified process- probably just approval from the team respo... [12:48:49] !log restart ulsfo switches for network maintenance [12:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:25] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] D:apereo_cas::service Allow OIDC to define release policies [puppet] - 10https://gerrit.wikimedia.org/r/881365 (owner: 10Slyngshede) [12:49:34] (03CR) 10Jbond: puppetmaster: add prometheus metrics for cert expiration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [12:50:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-proxies (exit_code=0) rolling restart_daemons on A:eqiad and A:swift-fe or A:thanos-fe [12:51:41] (03PS1) 10Filippo Giunchedi: hieradata: enable ssh key ldap lookup in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/883162 [12:51:48] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [12:51:51] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [12:53:04] 10SRE-tools, 10Infrastructure-Foundations: sre.swift.roll-restart-reboot-proxies fails on thanos hosts, which lack nginx - https://phabricator.wikimedia.org/T327783 (10MatthewVernon) [12:53:54] PROBLEM - VRRP status on cr3-ulsfo is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [12:54:16] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 10 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:54:28] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect [12:54:28] , AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:54] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 10 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:55:02] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - [12:55:02] , AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:55:06] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:55:22] 10SRE, 10SRE-Access-Requests: Requesting access to Jupyter for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) a:03Clement_Goubert [12:56:09] (03CR) 10Muehlenhoff: "Ack, I'll merge this later the day." [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [12:56:32] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_ulsfo_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [12:56:46] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:56:49] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable ssh key ldap lookup in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/883162 (owner: 10Filippo Giunchedi) [12:56:50] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881468|Remove PoolCounter from extension-list (T327336)]] (duration: 44m 09s) [12:56:54] T327336: Undeploy PoolCounter extension from wmf production - https://phabricator.wikimedia.org/T327336 [12:57:08] RECOVERY - VRRP status on cr3-ulsfo is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [12:58:04] 10SRE, 10SRE-Access-Requests: Procurement Tickets Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10Gehel) As @pfischer's manager, I approve (if that's needed). [12:59:08] 10SRE-swift-storage, 10Wikimedia-production-error: High rate of upload 502 errors/timeouts (was FileBackendError: Iterator page I/O error.) - https://phabricator.wikimedia.org/T327681 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I did a rolling-restart of the eqiad swift frontends, which looks t... [12:59:20] PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:00:13] 10SRE-tools, 10Infrastructure-Foundations: sre.swift.roll-restart-reboot-proxies fails on thanos hosts, which lack nginx - https://phabricator.wikimedia.org/T327783 (10MoritzMuehlenhoff) We ran into a similar issue in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/875469, I'm inclined to simply respin... [13:03:30] 10SRE, 10SRE-Access-Requests: Procurement Tickets Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10MoritzMuehlenhoff) a:03RobH [13:03:32] (Device rebooted) firing: Alert for device asw2-ulsfo.mgmt.ulsfo.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [13:04:32] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ping2002.codfw.wmnet [13:07:41] 10SRE, 10SRE-Access-Requests: Requesting access to Jupyter for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) [13:07:48] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:32] (Device rebooted) resolved: Device asw2-ulsfo.mgmt.ulsfo.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [13:08:34] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:09:56] RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:10:46] !log enabling tunnel services on cr2-eqdfw fpc 0 pic 1 [13:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:22] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:11:51] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:14:14] 10SRE, 10SRE-Access-Requests: Requesting access to Jupyter for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) Hi @Samwalton9, I need : [] Approval from @Ottomata or @odimitrijevic for the privilege extension to shell access, as group approvers [] Approval from @DannyH for the same,... [13:14:19] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:16:23] (03PS1) 10Elukey: profile::pki::root_ca: add new intermediates for liftwing [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) [13:18:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:18:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:18:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping2002.codfw.wmnet [13:18:13] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ping2002.codfw.wmnet` - ping2002.... [13:18:27] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ping1002.eqiad.wmnet [13:19:54] (ProbeDown) firing: (8) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:54] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:08] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:20:16] <_joe_> XioNoX: I assume that's you right? [13:20:32] _joe_: not sure, what's the issue? [13:20:40] it's plausible though [13:20:47] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:20:52] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:20:52] <_joe_> XioNoX: ulsfo probes [13:20:58] <_joe_> specifically ncredir [13:21:00] _joe_: ulsfo, then yep [13:21:01] <_joe_> !incidents [13:21:14] _joe_: not sure how to downtime this specifically, I tried to downtime everything I could [13:21:18] ulsfo is depooled [13:21:32] <_joe_> ok thanks [13:21:46] <_joe_> slyngs: around? [13:21:50] Yes [13:21:50] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 103, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:21:52] 10SRE, 10SRE-Access-Requests: Requesting access to Jupyter for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [13:22:47] (JobUnavailable) firing: (24) Reduced availability for job bird in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:23] <_joe_> !incidents [13:24:24] 3271 (ACKED) [FIRING:14] ProbeDown (probes/service ops page ulsfo prometheus sre) [13:24:24] 3267 (RESOLVED) db1170 (paged)/MariaDB Replica SQL: s7 (paged) [13:24:24] 3268 (RESOLVED) db1170 (paged)/MariaDB Replica Lag: s7 (paged) [13:24:24] 3266 (RESOLVED) db1105 (paged)/MariaDB Replica SQL: s2 (paged) [13:24:54] (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:54] (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:14] <_joe_> ok... [13:25:19] someone knows how I can downtime that alert for ulsfo only? [13:25:30] <_joe_> XioNoX: no idea sorry [13:25:38] godog maybe? [13:26:45] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:27:28] 10SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) [13:27:47] I created silence a105cf3d-a17d-4655-928c-ea91339195a1 manually, I think it will do that [13:28:05] however hte "view silence" link points to "http://localhost:9093/#/silences/a105cf3d-a17d-4655-928c-ea91339195a1" [13:29:21] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:30:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:30:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:30:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping1002.eqiad.wmnet [13:31:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ping1002.eqiad.wmnet` - ping1002.... [13:32:11] (03CR) 10Hashar: "I have inspected the result Puppet catalogue and it looks good to me ;)" [puppet] - 10https://gerrit.wikimedia.org/r/860837 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:32:13] (03PS2) 10Muehlenhoff: Remove old ping hosts [puppet] - 10https://gerrit.wikimedia.org/r/883137 (https://phabricator.wikimedia.org/T273509) [13:32:19] (03CR) 10Hashar: [C: 03+1] jenkins: add remaining config for Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/860837 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:32:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove old ping hosts [puppet] - 10https://gerrit.wikimedia.org/r/883137 (https://phabricator.wikimedia.org/T273509) (owner: 10Muehlenhoff) [13:34:52] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:36:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:50] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 (10MoritzMuehlenhoff) 05Open→03Resolved New ping1003/ping2003/ping3003 Bullseye VMs with 10G disk space have been created a... [13:37:30] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:38:27] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:38:38] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:40:00] (03PS1) 10Cathal Mooney: Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) [13:40:08] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:42:01] (03PS1) 10Ilias Sarantopoulos: ml-services: upgrade revscoring kserve to 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/883171 (https://phabricator.wikimedia.org/T325528) [13:44:34] !log reboot ulsfo switches for software upgrade [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:40] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:48:58] (03PS1) 10Jbond: postgress: update password check to use grep 1 [puppet] - 10https://gerrit.wikimedia.org/r/883173 [13:49:00] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - [13:49:00] , AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:49:03] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:19] (03CR) 10CI reject: [V: 04-1] postgress: update password check to use grep 1 [puppet] - 10https://gerrit.wikimedia.org/r/883173 (owner: 10Jbond) [13:49:23] (03PS2) 10Jbond: postgress: update password check to use grep 1 [puppet] - 10https://gerrit.wikimedia.org/r/883173 [13:49:52] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 10 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:49:56] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect [13:49:56] , AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:18] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:51:09] (03CR) 10Ayounsi: Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) (owner: 10Cathal Mooney) [13:51:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [13:53:04] (03CR) 10Jbond: [C: 03+2] postgress: update password check to use grep 1 [puppet] - 10https://gerrit.wikimedia.org/r/883173 (owner: 10Jbond) [13:53:32] PROBLEM - VRRP status on cr3-ulsfo is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:54:03] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:55:02] RECOVERY - VRRP status on cr3-ulsfo is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:55:46] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:55:48] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:56] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:56:16] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 103, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:48] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_ulsfo_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:13] 10SRE, 10SRE-Access-Requests: Procurement Tickets Access Request for pfischer - https://phabricator.wikimedia.org/T327765 (10RobH) 05Open→03Resolved @pfischer: I've added you to #acl*procurement-review and you can now access tasks in the S4 procurement space: * Anything in #procurement or S4 is private in... [13:59:44] (03PS1) 10JMeybohm: Fix reference to mcrouter pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/883177 (https://phabricator.wikimedia.org/T327786) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1400). Please do the needful. [14:00:05] duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1400) [14:01:11] (03PS1) 10Jbond: postgres: use = to check password instead of distinct from [puppet] - 10https://gerrit.wikimedia.org/r/883179 [14:01:29] (03CR) 10Jbond: [C: 03+2] postgres: use = to check password instead of distinct from [puppet] - 10https://gerrit.wikimedia.org/r/883179 (owner: 10Jbond) [14:02:08] I can deploy in ~5 minutes if not beaten to it :-) [14:03:20] PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:03:41] o/ [14:04:24] TheresNoTime: would be great if you could. I was going to do it myself, but something came up and I'm slightly distracted. [14:04:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Jclark-ctr) @marostegui Dimm arrived I am available now to replace it [14:04:35] duesen: sure, almost ready :-) [14:05:42] (03PS6) 10Samtar: Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:07:58] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:08:18] 10SRE-tools, 10Infrastructure-Foundations: sre.swift.roll-restart-reboot-proxies fails on thanos hosts, which lack nginx - https://phabricator.wikimedia.org/T327783 (10MatthewVernon) I don't feel strongly, but I would like in the medium term for the thanos frontends and the swift frontends to be much more simi... [14:09:22] (03Merged) 10jenkins-bot: Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:09:49] (03CR) 10DCausse: [C: 03+1] dse-k8s: add rdf-streaming-update-ng namespace (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [14:09:50] !log samtar@deploy1002 Started scap: Backport for [[gerrit:868127|Increase PC writes from parsoid API to 10% (T320534)]] [14:09:54] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:09:56] TheresNoTime: the change only affects an internal API, I can't test is on debug hosts. [14:10:19] TheresNoTime: once it is deployed everywhere, i should be able to see an uptick in the relevant metrics [14:10:38] duesen: okay, will push it through [14:10:44] thanks [14:11:37] !log samtar@deploy1002 daniel and samtar: Backport for [[gerrit:868127|Increase PC writes from parsoid API to 10% (T320534)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:13:42] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [14:13:52] RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:14:11] (03PS1) 10Ayounsi: Revert "Depool ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/882743 (https://phabricator.wikimedia.org/T316532) [14:14:33] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:15:38] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:16:28] hm ^ [14:16:46] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 240 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:16:54] <_joe_> uh eqsin troubles? [14:16:59] looks like india? [14:17:04] <_joe_> XioNoX, topranks ^^ [14:17:12] duesen: can you confirm unrelated (can't imagine it would be?) [14:17:18] <_joe_> let me ack the incident [14:17:25] <_joe_> !incidents [14:17:25] 3272 (UNACKED) [FIRING:1] NELHigh (page thanos sre tcp.timed_out) [14:17:25] 3271 (RESOLVED) [FIRING:14] ProbeDown (probes/service ops page ulsfo prometheus sre) [14:17:25] 3267 (RESOLVED) db1170 (paged)/MariaDB Replica SQL: s7 (paged) [14:17:26] 3268 (RESOLVED) db1170 (paged)/MariaDB Replica Lag: s7 (paged) [14:17:26] 3266 (RESOLVED) db1105 (paged)/MariaDB Replica SQL: s2 (paged) [14:17:30] <_joe_> !ack 3272 [14:17:31] 3272 (ACKED) [FIRING:1] NELHigh (page thanos sre tcp.timed_out) [14:17:32] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:868127|Increase PC writes from parsoid API to 10% (T320534)]] (duration: 07m 41s) [14:17:36] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:17:46] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/868127 just finished deployment [14:17:56] <_joe_> XioNoX: should we try to depool eqsin? [14:18:24] _joe_: we can prepare it, might be hotlinking going away [14:18:35] <_joe_> ah right [14:18:55] also ulsfo is just coming back from network maint (not repooled yet?) [14:19:05] checking superset [14:19:08] so failing out eqsin is even more impactful than usual [14:19:10] bblack: not repooled [14:19:18] bblack: but it can be repooled [14:19:21] maintenance over [14:19:32] bblack: I was about to https://gerrit.wikimedia.org/r/c/operations/dns/+/882743 [14:19:45] XioNoX: if the risks are gone, go ahead, we might need it! [14:19:59] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/882743 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [14:20:06] <_joe_> yeah [14:20:15] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host druid1010.eqiad.wmnet with OS bullseye [14:20:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host druid1010.eqiad.wmnet with OS bull... [14:20:29] !log repool ulsfo (maintenance over) [14:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:33] <_joe_> XioNoX: any evidence of network saturation? [14:20:56] no smoking gun on https://superset.wikimedia.org/superset/dashboard/webrequest-live/? [14:21:05] (03CR) 10Elukey: [C: 03+2] ml-services: upgrade revscoring kserve to 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/883171 (https://phabricator.wikimedia.org/T325528) (owner: 10Ilias Sarantopoulos) [14:21:10] nor internally on https://grafana.wikimedia.org/d/m1LYjVjnz/network-icmp-probes?orgId=1&var-site=All&var-target_site=eqsin&var-role=cr&var-family=All [14:21:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:21:44] <_joe_> XioNoX: I would bet it was a routing issue well upstream of us [14:22:22] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 2 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:22:28] _joe_: Last flapped : 2023-01-24 14:19:14 UTC (00:03:06 ago) [14:22:38] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:22:42] our Tata transit port in eqsin [14:22:47] so looks like upstream issue [14:22:49] For clarity, this was not caused by the deployment I just finished doing, and as such I can continue with the window - correct? [14:23:02] <_joe_> TheresNoTime: correct [14:23:06] thank you :) [14:23:18] <_joe_> TheresNoTime: but you will be our scapegoat anyways [14:23:27] :D [14:23:28] <_joe_> :P [14:23:39] interface is back up [14:24:15] Just in NEL errors has subsided also. IN and IR biggest two countries. [14:24:28] because it was Tata, makes sense [14:24:42] <_joe_> IR is I think due to other factors too [14:24:44] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:25:10] yeah, the interface is not flapping (only went down once), if it happen again we can disable it and follow up upstream [14:25:14] duesen: 868127 deployed, going to close the deployment window now, nothing looks immediately "concerning" from logstash so I'm happy to leave it to you to check your metrics whenever :) [14:25:52] !log close UTC afternoon backport window [14:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:59] I'm glad that didn't happen during my maintenance :) [14:26:25] <_joe_> TheresNoTime: it's reasonable to think that your deployment caused our httpbb tests to fail though, because of a race condition :) [14:26:33] there is no planned maintenance neither from Tata [14:26:38] <_joe_> but it's not your fault, rather we are to blame [14:26:44] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:08] _joe_: I was just checking the httpbb, and it's working fine now [14:27:18] _joe_: as long as I am free of blame, it can be anyone's fault, I don't mind :p /joke [14:27:20] <_joe_> slyngs: yes I restarted the service [14:27:33] _joe_: Okay, then it got restarted twice :-) [14:27:38] <_joe_> slyngs: basically we check and individual appserver there, and it was restarted for the deployment [14:27:57] (03PS1) 10Jbond: puppetdb: add auth.conf file [puppet] - 10https://gerrit.wikimedia.org/r/883184 (https://phabricator.wikimedia.org/T321783) [14:28:04] <_joe_> so apache returned 503s [14:28:08] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:28:17] (03CR) 10CI reject: [V: 04-1] puppetdb: add auth.conf file [puppet] - 10https://gerrit.wikimedia.org/r/883184 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:28:44] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:29:05] !log switch maps (kartotherian) from eqiad to codfw (attempt #2) [14:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:15] TheresNoTime: sorry, as I said, distracted... checking now [14:29:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39219/console" [puppet] - 10https://gerrit.wikimedia.org/r/883184 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:29:23] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [14:29:25] (no problem!) [14:29:50] (03PS2) 10Jbond: puppetdb: add auth.conf file [puppet] - 10https://gerrit.wikimedia.org/r/883184 (https://phabricator.wikimedia.org/T321783) [14:29:58] TheresNoTime: yep, looking good! [14:30:08] Amir1: PC writes are up to 10% now [14:30:21] good to hear :) [14:31:11] (03CR) 10Jbond: [C: 03+2] puppetdb: add auth.conf file [puppet] - 10https://gerrit.wikimedia.org/r/883184 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:32:36] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:33:29] !log volans@cumin1001 START - Cookbook sre.dns.netbox [14:34:38] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [14:35:01] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:35:11] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:35:18] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Force update after switch upgrade - volans@cumin1001" [14:35:21] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:35:27] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:35:38] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:35:48] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:35:56] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:36:22] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Force update after switch upgrade - volans@cumin1001" [14:36:22] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:43] duesen: noted [14:38:01] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1010.eqiad.wmnet with reason: host reimage [14:39:16] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [14:41:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1010.eqiad.wmnet with reason: host reimage [14:41:12] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:41:54] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [14:42:09] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! fpc2 didn't like the first "blank" reboot and required a power cycle using... [14:47:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39220/console" [puppet] - 10https://gerrit.wikimedia.org/r/883184 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:50:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:51:23] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:52:17] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main [14:52:27] (03PS3) 10Muehlenhoff: Migrate service definitions to CasRegisteredService [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) [14:52:50] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:53:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:54:54] XioNoX: hey, were you able to address the issue ? [14:55:48] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:57:31] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main [14:58:07] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:58:45] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:03:03] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:03:29] (03PS1) 10Jbond: puppet: install puppet-module-puppetlabs-augeas-core on puppet >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/883189 (https://phabricator.wikimedia.org/T321783) [15:03:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:03:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:04:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39221/console" [puppet] - 10https://gerrit.wikimedia.org/r/883189 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:05:36] (03CR) 10Muehlenhoff: puppet: install puppet-module-puppetlabs-augeas-core on puppet >= 6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883189 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:07:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet: install puppet-module-puppetlabs-augeas-core on puppet >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/883189 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:07:53] (03PS2) 10Jbond: puppet: install puppet-module-puppetlabs-augeas-core on puppet >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/883189 (https://phabricator.wikimedia.org/T321783) [15:08:06] (03CR) 10Jbond: puppet: install puppet-module-puppetlabs-augeas-core on puppet >= 6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883189 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:08:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883189 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:11:29] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@15e6aa7] (codfw): Revert "codfw: Disable traffic mirroring" [15:12:02] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@15e6aa7] (codfw): Revert "codfw: Disable traffic mirroring" (duration: 00m 33s) [15:12:33] (03CR) 10Muehlenhoff: [C: 03+2] Migrate service definitions to CasRegisteredService [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [15:14:25] (03PS1) 10Jbond: postgres: set the postgress method based on the pgversion [puppet] - 10https://gerrit.wikimedia.org/r/883192 [15:14:53] (03PS2) 10Jbond: postgres: set the postgress method based on the pgversion [puppet] - 10https://gerrit.wikimedia.org/r/883192 [15:15:41] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad [15:16:56] (03PS3) 10Jbond: postgres: set the postgress method based on the pgversion [puppet] - 10https://gerrit.wikimedia.org/r/883192 [15:17:21] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5c58f8f] (codfw): Disable traffic mirroring from codfw to eqiad (duration: 01m 40s) [15:21:07] (03PS4) 10Jbond: postgres: set the postgress method based on the pgversion [puppet] - 10https://gerrit.wikimedia.org/r/883192 [15:21:51] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Marostegui) @Jclark-ctr please proceed whenever you can. I have powered the host off. [15:24:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39225/console" [puppet] - 10https://gerrit.wikimedia.org/r/883192 (owner: 10Jbond) [15:26:48] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:27:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Ottomata) [15:27:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Ottomata) Approved. This will need kerberos access too. [15:27:25] (03PS5) 10Jbond: postgres: set the postgress method based on the pgversion [puppet] - 10https://gerrit.wikimedia.org/r/883192 [15:28:41] (03PS6) 10Jbond: postgres: set the postgress method based on the pgversion [puppet] - 10https://gerrit.wikimedia.org/r/883192 [15:28:53] (03PS7) 10Jbond: postgres: set the postgress method based on the pgversion [puppet] - 10https://gerrit.wikimedia.org/r/883192 [15:29:50] (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [15:30:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39227/console" [puppet] - 10https://gerrit.wikimedia.org/r/883192 (owner: 10Jbond) [15:30:35] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main [15:30:42] (03CR) 10Jbond: [C: 03+2] puppet: install puppet-module-puppetlabs-augeas-core on puppet >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/883189 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:30:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] postgres: set the postgress method based on the pgversion [puppet] - 10https://gerrit.wikimedia.org/r/883192 (owner: 10Jbond) [15:31:08] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:32:48] (03CR) 10Ottomata: dse-k8s: add rdf-streaming-update-ng namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [15:33:06] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/881386 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:34:31] (03PS2) 10Elukey: profile::pki::root_ca: add new intermediates for liftwing [puppet] - 10https://gerrit.wikimedia.org/r/883167 (https://phabricator.wikimedia.org/T327767) [15:34:44] (03PS1) 10Muehlenhoff: Failover IDP to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/883196 [15:38:05] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) [15:38:54] (03CR) 10JMeybohm: [C: 03+2] Fix reference to mcrouter pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/883177 (https://phabricator.wikimedia.org/T327786) (owner: 10JMeybohm) [15:38:56] (03PS1) 10Jgiannelos: maps: Add missing index script on import [puppet] - 10https://gerrit.wikimedia.org/r/883197 [15:39:39] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) Ack. CR updated for kerberos access. [15:41:47] (03CR) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [15:43:51] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:45:14] (03Merged) 10jenkins-bot: Fix reference to mcrouter pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/883177 (https://phabricator.wikimedia.org/T327786) (owner: 10JMeybohm) [15:45:27] (03PS1) 10Jforrester: [BETA CLUSTER] Don't try to load Kartographer on Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883201 (https://phabricator.wikimedia.org/T327724) [15:46:05] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) 05Resolved→03Stalled [15:46:10] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) [15:46:15] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/883196 (owner: 10Muehlenhoff) [15:48:02] (03PS1) 10Muehlenhoff: Extend access for dani [puppet] - 10https://gerrit.wikimedia.org/r/883202 [15:49:00] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for dani [puppet] - 10https://gerrit.wikimedia.org/r/883202 (owner: 10Muehlenhoff) [15:52:31] (03PS1) 10Effie Mouzeli: maps: enable tile pregeneration job on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/883204 [15:53:29] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2042.codfw.wmnet with OS bullseye [15:53:57] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) [15:54:03] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:54:28] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) 05Stalled→03Resolved Retention updated for `mediawiki.httpd.accesslog` in `codfw` [15:57:12] (03PS1) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 [15:57:29] (03CR) 10Volans: [C: 03+2] interactive: allow free responses in ask_input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881649 (https://phabricator.wikimedia.org/T327408) (owner: 10Volans) [15:57:41] (03CR) 10Volans: [C: 03+2] setup.py: add support for Python 3.10 and 3.11 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881650 (owner: 10Volans) [15:58:07] (03PS1) 10Vgutierrez: Revert "Temporary rate exemption for IABot source IPs" [puppet] - 10https://gerrit.wikimedia.org/r/883209 (https://phabricator.wikimedia.org/T318065) [15:58:36] (03PS2) 10Vgutierrez: Revert "Temporary rate exemption for IABot source IPs" [puppet] - 10https://gerrit.wikimedia.org/r/883209 (https://phabricator.wikimedia.org/T318065) [15:58:41] (03PS2) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 [15:59:31] (03CR) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [15:59:37] (03PS4) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) [16:00:38] (03CR) 10Dzahn: [C: 03+2] "empty group mediawiki-testers is removed, mwdebuggers is created. https://puppet-compiler.wmflabs.org/output/879147/39230/mw1414.eqiad.wmn" [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [16:01:32] (03Merged) 10jenkins-bot: interactive: allow free responses in ask_input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881649 (https://phabricator.wikimedia.org/T327408) (owner: 10Volans) [16:01:53] (03Merged) 10jenkins-bot: setup.py: add support for Python 3.10 and 3.11 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881650 (owner: 10Volans) [16:03:38] (03PS1) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883227 (https://phabricator.wikimedia.org/T327783) [16:04:11] (03CR) 10Andrew Bogott: [C: 03+2] profile::base: fix hiera key name for tls_client_auth [puppet] - 10https://gerrit.wikimedia.org/r/876251 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [16:05:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 (owner: 10Volans) [16:06:04] (03CR) 10CI reject: [V: 04-1] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883227 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [16:06:56] (03PS1) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) [16:07:48] jouncebot: next [16:07:48] In 0 hour(s) and 52 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1700) [16:08:38] (03CR) 10CI reject: [V: 04-1] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [16:08:52] (03CR) 10Cwhite: [C: 03+2] opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [16:09:36] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2042.codfw.wmnet with reason: host reimage [16:10:19] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:10:32] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:12:27] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) @daniel Here you go, you (and other deployers) should now be able to disable (and enable) puppet on med... [16:12:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2042.codfw.wmnet with reason: host reimage [16:13:22] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) 05Open→03Resolved a:03Dzahn [16:13:33] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [16:15:34] (03CR) 10Volans: Split Swift cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/883227 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [16:17:45] (03CR) 10Effie Mouzeli: [C: 03+2] maps: enable tile pregeneration job on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/883204 (owner: 10Effie Mouzeli) [16:19:41] (03PS8) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 [16:19:51] (03CR) 10Vgutierrez: [C: 03+2] Revert "Temporary rate exemption for IABot source IPs" [puppet] - 10https://gerrit.wikimedia.org/r/883209 (https://phabricator.wikimedia.org/T318065) (owner: 10Vgutierrez) [16:22:24] (03Merged) 10jenkins-bot: maps: enable tile pregeneration job on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/883204 (owner: 10Effie Mouzeli) [16:22:49] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [16:22:52] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [16:23:02] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [16:23:18] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [16:23:22] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:23:24] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:28:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2042.codfw.wmnet with OS bullseye [16:29:04] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:29:22] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:34:09] (03PS3) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) [16:34:29] (03PS4) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) [16:39:29] (03CR) 10Volans: [C: 03+1] "This can go now" [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 (owner: 10Volans) [16:40:09] (03PS1) 10Jbond: Puppetfile: order puppet file and add some addtional notes [puppet] - 10https://gerrit.wikimedia.org/r/883232 [16:40:11] (03PS1) 10Jbond: augeas_core: add augeas core module to the vendor modules [puppet] - 10https://gerrit.wikimedia.org/r/883233 (https://phabricator.wikimedia.org/T321783) [16:41:25] (03PS2) 10Jbond: augeas_core: add augeas core module to the vendor modules [puppet] - 10https://gerrit.wikimedia.org/r/883233 (https://phabricator.wikimedia.org/T321783) [16:46:32] (03CR) 10Volans: [C: 04-1] "Would not produce the expected result" [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [16:47:00] (03PS2) 10Volans: sre.hosts.provision: set iDRAC host/domain names [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 [16:47:28] (03PS1) 10Cwhite: Enable interface and rule fields. [software/ecs] - 10https://gerrit.wikimedia.org/r/882777 (https://phabricator.wikimedia.org/T325806) [16:47:35] (03PS4) 10Clément Goubert: httpd-cgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 [16:49:48] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: set iDRAC host/domain names [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 (owner: 10Volans) [16:50:28] (03PS9) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 [16:51:31] (03Merged) 10jenkins-bot: sre.hosts.provision: set iDRAC host/domain names [cookbooks] - 10https://gerrit.wikimedia.org/r/857726 (owner: 10Volans) [16:53:03] (03PS1) 10Urbanecm: [Growth] Remove mentor list variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883236 (https://phabricator.wikimedia.org/T321501) [16:54:43] (03PS1) 10Andrew Bogott: mwopenstackclients3: black [puppet] - 10https://gerrit.wikimedia.org/r/883237 [16:54:45] (03PS1) 10Andrew Bogott: mwopenstackclients3: Add a bunch of retrying [puppet] - 10https://gerrit.wikimedia.org/r/883238 (https://phabricator.wikimedia.org/T327375) [16:57:02] (03CR) 10JHathaway: "Thanks for adding the additional comments to the puppetfile. Though ordering alphabetically is nice, do you think its worth the downside o" [puppet] - 10https://gerrit.wikimedia.org/r/883232 (owner: 10Jbond) [16:58:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:59:31] (03CR) 10JHathaway: [C: 03+1] "Other than some concerns with the "troll out plan" ;), I think this looks good." [puppet] - 10https://gerrit.wikimedia.org/r/883233 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:00:05] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1700). [17:00:05] jnuche: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:13] (03CR) 10Btullis: Add reverse DNS IPv4 entries for the staging-codfw k8s cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/883226 (https://phabricator.wikimedia.org/T327799) (owner: 10Btullis) [17:00:36] jnuche: hello! looking [17:01:02] rzl: hi there! [17:01:28] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients3: black [puppet] - 10https://gerrit.wikimedia.org/r/883237 (owner: 10Andrew Bogott) [17:02:53] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients3: Add a bunch of retrying [puppet] - 10https://gerrit.wikimedia.org/r/883238 (https://phabricator.wikimedia.org/T327375) (owner: 10Andrew Bogott) [17:03:03] (03PS2) 10Andrew Bogott: mwopenstackclients3: Add a bunch of retrying [puppet] - 10https://gerrit.wikimedia.org/r/883238 (https://phabricator.wikimedia.org/T327375) [17:03:06] * jbond here as well if needed [17:03:07] (03CR) 10RLazarus: [C: 03+2] jenkins: add remaining config for Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/860837 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [17:03:38] jnuche: merging now -- do you need puppet manually run anywhere so that you can test? [17:04:19] !log rebooting restbase cassandra nodes, row c -- T325132 [17:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:21] rzl: yeah, can you please run it on deploy1002.eqiad.wmnet? [17:04:49] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2015.codfw.wmnet [17:04:51] ⏳ [17:06:16] jnuche: from deploy1002: [17:06:19] Error: Execution of '/usr/bin/scap deploy --init' returned 1: [17:06:19] Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[releng/jenkins-deploy]/Scap_source[releng/jenkins-deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 1: [17:06:44] I'll be nearby while you dig, let me know if you want to roll back or fix forward [17:07:57] rzl: humm, is there any more output to that scap error? [17:07:59] (03CR) 10Jbond: "thanks for the quick review" [puppet] - 10https://gerrit.wikimedia.org/r/883232 (owner: 10Jbond) [17:08:21] jnuche: nothing that came out in the puppet output [17:08:59] here's everything I got https://www.irccloud.com/pastebin/9qXqNbZx/ [17:09:00] (03PS3) 10Jbond: augeas_core: add augeas core module to the vendor modules [puppet] - 10https://gerrit.wikimedia.org/r/883233 (https://phabricator.wikimedia.org/T321783) [17:09:03] rzl: ok, let's roll back, I'll create the patch now [17:09:12] sgtm, thanks [17:09:21] (03CR) 10Jbond: augeas_core: add augeas core module to the vendor modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883233 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [17:09:43] (03PS1) 10Hnowlan: changeprop: use wmf-certificates instead of puppet_ca_crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 [17:10:46] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS bullseye [17:10:56] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5017.eqsin.wmnet with OS bullseye [17:11:27] (03CR) 10JHathaway: [C: 03+1] Puppetfile: order puppet file and add some addtional notes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883232 (owner: 10Jbond) [17:13:14] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2015.codfw.wmnet [17:13:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2016.codfw.wmnet [17:14:56] (03PS1) 10Jaime Nuche: Revert "jenkins: add remaining config for Scap3 deployment" [puppet] - 10https://gerrit.wikimedia.org/r/883242 [17:15:39] rzl: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883242 [17:15:57] (03CR) 10RLazarus: [C: 03+2] Revert "jenkins: add remaining config for Scap3 deployment" [puppet] - 10https://gerrit.wikimedia.org/r/883242 (owner: 10Jaime Nuche) [17:17:09] (03PS3) 10Cwhite: WIP: add rt_flow grokking [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [17:17:16] (03PS1) 10Jdrewniak: Add temporary extra grid-area for content translation extension [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883212 (https://phabricator.wikimedia.org/T327715) [17:18:09] jnuche: new puppet run on deploy1002 is clean, thanks for the quick rollback 👍 [17:18:38] rzl: will need to look into the error before trying again, thanks for the support! [17:18:46] (03PS1) 10Jdrewniak: Fix Wikitext editor preview layout in Vector 2022 [extensions/VisualEditor] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883213 (https://phabricator.wikimedia.org/T327778) [17:19:11] (03CR) 10CI reject: [V: 04-1] WIP: add rt_flow grokking [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [17:19:17] let me know if I can help gather data -- or if you want to try again between puppet windows, feel free to ping me any time in UTC-8 work hours :) [17:19:52] !log restarting ci jenkins for updates [17:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2016.codfw.wmnet [17:21:06] will do, thanks again [17:21:41] (03CR) 10Cathal Mooney: Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) (owner: 10Cathal Mooney) [17:22:47] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:22:57] godog: one issue is that when I click on "show the silence" the links points to a localhost page [17:24:52] XioNoX: from the notification email, correct ? [17:26:33] from the karma UI [17:26:43] after creating a silence manually [17:27:08] IIRC that's an ooooold issue [17:27:38] ah ok got it [17:28:16] yeah that is supposed to link to the AM ui, which debian doesn't ship [17:28:48] granted it could be nicer than localhost [17:28:49] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2020.codfw.wmnet [17:29:13] (03CR) 10Elukey: "Thanks a ton! Left two comments just to be sure, the rest looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 (owner: 10Hnowlan) [17:29:43] (03PS2) 10Andrea Denisse: centrallog: Apply partman standard software raid recipe [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T318778) [17:31:49] (03CR) 10Andrea Denisse: centrallog: Apply partman standard software raid recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [17:32:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [17:33:36] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39231/console" [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [17:33:48] (03CR) 10Hnowlan: changeprop: use wmf-certificates instead of puppet_ca_crt (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/883240 (owner: 10Hnowlan) [17:36:09] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5017.eqsin.wmnet with OS bullseye [17:36:14] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5017.eqsin.wmnet with OS bullseye executed with errors: - cp5017 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [17:37:30] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS bullseye [17:37:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5017.eqsin.wmnet with OS bullseye [17:40:06] (03CR) 10Andrew Bogott: [C: 04-1] "Francesco suggests that we log on each retry so we notice if things are failing a lot. I'll amend to add that." [puppet] - 10https://gerrit.wikimedia.org/r/883238 (https://phabricator.wikimedia.org/T327375) (owner: 10Andrew Bogott) [17:40:13] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host restbase2020.codfw.wmnet [17:40:48] (03PS1) 10BBlack: Configure transit_buffer for bullseye varnish [puppet] - 10https://gerrit.wikimedia.org/r/883246 (https://phabricator.wikimedia.org/T325797) [17:41:10] (03PS1) 10Volans: setup.py: force a newer sphinx_rtd_theme [software/pywmflib] - 10https://gerrit.wikimedia.org/r/883247 [17:41:29] (03CR) 10Ayounsi: Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) (owner: 10Cathal Mooney) [17:42:20] (03CR) 10Volans: Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) (owner: 10Cathal Mooney) [17:44:04] !log cp5032: upgrading packages (varnish, trafficserver [17:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:50] (03PS2) 10Cathal Mooney: Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) [17:47:58] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: List Herald actions on archived tags [puppet] - 10https://gerrit.wikimedia.org/r/881884 (https://phabricator.wikimedia.org/T327508) (owner: 10Aklapper) [17:48:03] (03PS2) 10Dzahn: phabricator weekly changes email: List Herald actions on archived tags [puppet] - 10https://gerrit.wikimedia.org/r/881884 (https://phabricator.wikimedia.org/T327508) (owner: 10Aklapper) [17:51:45] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) [17:51:51] (03CR) 10Ayounsi: [C: 03+1] Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) (owner: 10Cathal Mooney) [17:54:07] (03CR) 10Dzahn: "@Marostegui The application has been removed. The mysql GRANTS can be removed as well." [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:56:27] (03CR) 10Jcrespo: "I recommend a rebase (will need manual merging), as grants changed at https://gerrit.wikimedia.org/r/c/operations/puppet/+/881868" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:57:20] (03CR) 10Dzahn: "I manually deleted the application/deployment dir on both backend servers." [puppet] - 10https://gerrit.wikimedia.org/r/881696 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:57:43] (03PS2) 10Dzahn: mariadb: remove grants and settings for racktables db [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) [17:58:04] (03CR) 10Dzahn: "Thanks! I told gerrit to rebase on 881868 and appears to have worked." [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:58:11] (03PS1) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 [17:58:32] (03CR) 10CI reject: [V: 04-1] D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1800) [18:00:45] (03PS3) 10Dzahn: mariadb: remove grants and settings for racktables db [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) [18:01:46] (03PS2) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 [18:04:25] (03CR) 10BBlack: "PCC checks out ok on buster+bullseye here: https://puppet-compiler.wmflabs.org/output/883246/39232/" [puppet] - 10https://gerrit.wikimedia.org/r/883246 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [18:05:00] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet [18:05:08] (03CR) 10Cathal Mooney: [C: 03+2] Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) (owner: 10Cathal Mooney) [18:05:45] (03Merged) 10jenkins-bot: Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) (owner: 10Cathal Mooney) [18:06:03] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable interface and rule fields. [software/ecs] - 10https://gerrit.wikimedia.org/r/882777 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [18:06:21] (03CR) 10Cathal Mooney: [C: 03+2] Add OSPF adjcaency over GRE from cr2-eqsin to cr2-eqdfw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/883170 (https://phabricator.wikimedia.org/T327265) (owner: 10Cathal Mooney) [18:07:17] 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) 05Stalled→03Resolved added details to the recycled tab in the datacenter asset tags tracking google sheet and delted from netbox. [18:07:32] 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) [18:07:39] (03CR) 10Jcrespo: mariadb: remove grants and settings for racktables db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:10:49] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] centrallog: Apply partman standard software raid recipe [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [18:12:11] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2022.codfw.wmnet [18:14:18] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] centrallog: Apply partman standard software raid recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [18:14:44] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage [18:15:29] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39233/console" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [18:16:36] (03CR) 10Cwhite: [C: 03+2] Enable interface and rule fields. [software/ecs] - 10https://gerrit.wikimedia.org/r/882777 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [18:16:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39234/console" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [18:17:05] (03Merged) 10jenkins-bot: Enable interface and rule fields. [software/ecs] - 10https://gerrit.wikimedia.org/r/882777 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [18:17:50] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage [18:19:07] (03PS2) 10Dzahn: racktables: delete profile and entire module [puppet] - 10https://gerrit.wikimedia.org/r/881696 (https://phabricator.wikimedia.org/T327405) [18:19:43] (03PS3) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 [18:20:15] (03PS6) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [18:20:36] (03CR) 10CI reject: [V: 04-1] centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [18:22:56] (03PS7) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [18:23:15] (03CR) 10CI reject: [V: 04-1] centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [18:24:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10DannyH) I approve, as Sam's manager. [18:24:39] (03PS8) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [18:24:59] (03CR) 10Bking: dse-k8s: add rdf-streaming-update-ng namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [18:25:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) [18:25:20] (03Abandoned) 10Bking: dse-k8s: add rdf-streaming-update-ng namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [18:25:26] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/881696/39238/" [puppet] - 10https://gerrit.wikimedia.org/r/881696 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:28:17] (03CR) 10Jcrespo: "I am removing the m1 grants *for dumps* from both eqiad and codfw now for the live servers, as I am deploying the new grants in the replic" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:29:09] (03CR) 10Dzahn: "sounds good to me. and yea, Gerrit did not do the full thing (but also did not fail) but then I amended after that" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:31:09] (03CR) 10Clément Goubert: [C: 03+2] admin: Grants for samwalton on analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/883187 (https://phabricator.wikimedia.org/T327780) (owner: 10Clément Goubert) [18:36:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Clement_Goubert) 05In progress→03Resolved @Samwalton9 your access to the relevant groups has been granted. Please wait 30m (as of this comment)... [18:36:57] (03CR) 10Ottomata: dse-k8s: add rdf-streaming-update-ng namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [18:38:34] (03PS1) 10Jdrewniak: Fix Wikitext editor preview layout in Vector 2022 [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883216 (https://phabricator.wikimedia.org/T327778) [18:39:03] (03Restored) 10Bking: dse-k8s: add rdf-streaming-update-ng namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [18:39:47] (03PS1) 10Jdrewniak: Add temporary extra grid-area for content translation extension [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883217 (https://phabricator.wikimedia.org/T327715) [18:40:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6009.drmrs.wmnet with OS bullseye [18:42:50] (03PS1) 10Cwhite: Bugfix: dynamic templates to use correct placeholder [software/ecs] - 10https://gerrit.wikimedia.org/r/882778 [18:43:30] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp5025.eqsin.wmnet with OS bullseye [18:43:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS bullseye [18:46:21] (03CR) 10Jcrespo: mariadb: remove grants and settings for racktables db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:49:38] (03CR) 10Cwhite: [C: 03+2] Bugfix: dynamic templates to use correct placeholder [software/ecs] - 10https://gerrit.wikimedia.org/r/882778 (owner: 10Cwhite) [18:50:20] (03Merged) 10jenkins-bot: Bugfix: dynamic templates to use correct placeholder [software/ecs] - 10https://gerrit.wikimedia.org/r/882778 (owner: 10Cwhite) [18:55:24] !log deploy new dump grants for analytics dbs at db1108 T327155 [18:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:28] T327155: Setup dbprov1004 an dbprov2004 as an expansion of the dbprov (database provisioning) cluster, in preparation of binlog backups backup implementation - https://phabricator.wikimedia.org/T327155 [18:58:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:58:43] (03PS1) 10Cwhite: logstash: deploy ecs 1.11.0-5 and enable in beta [puppet] - 10https://gerrit.wikimedia.org/r/882779 (https://phabricator.wikimedia.org/T325806) [18:58:45] (03PS1) 10Cwhite: logstash: enable ecs 1.11.0-5 in production [puppet] - 10https://gerrit.wikimedia.org/r/882780 (https://phabricator.wikimedia.org/T325806) [18:58:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:58:47] (03PS1) 10Cwhite: logstash: remove ecs 1.11.0-2 template [puppet] - 10https://gerrit.wikimedia.org/r/882781 (https://phabricator.wikimedia.org/T325806) [19:00:00] (03PS2) 10Jcrespo: dbbackups: Reorganize backups with the new dbprov[12]04 host [puppet] - 10https://gerrit.wikimedia.org/r/881360 (https://phabricator.wikimedia.org/T327155) [19:00:04] brennen and jnuche: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1900). [19:00:18] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) We discuss this during today's meeting, we are going to put 1 spine in A1 and the other spine in A8. When we upgrade ro... [19:00:26] o/ [19:00:38] (03PS1) 10Majavah: kubernetes: Use the shared image-config configmap [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) [19:01:24] (03CR) 10CI reject: [V: 04-1] kubernetes: Use the shared image-config configmap [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) (owner: 10Majavah) [19:01:32] (03CR) 10Cwhite: [C: 03+2] logstash: deploy ecs 1.11.0-5 and enable in beta [puppet] - 10https://gerrit.wikimedia.org/r/882779 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [19:01:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage [19:02:07] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883263 (https://phabricator.wikimedia.org/T325583) [19:02:09] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reorganize backups with the new dbprov[12]04 host [puppet] - 10https://gerrit.wikimedia.org/r/881360 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [19:02:11] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883263 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [19:02:20] (03PS2) 10Majavah: kubernetes: Use the shared image-config configmap [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/883261 (https://phabricator.wikimedia.org/T311918) [19:02:46] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883263 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [19:02:48] (03PS1) 10Ssingh: Release 0.15.0-3 [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/883264 (https://phabricator.wikimedia.org/T326634) [19:03:13] (03CR) 10CI reject: [V: 04-1] Add temporary extra grid-area for content translation extension [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883217 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [19:03:29] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) lvs1018 was the proper place to check on ldap-ro and upload, so I've updated the two comments above to reflect that missing data. [19:04:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage [19:05:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [19:05:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1010.eqiad.wmnet with OS bullseye [19:05:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye... [19:06:14] (03CR) 10Dzahn: "application on servers is deleted, puppet module is deleted from repo" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [19:06:26] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host druid1011.eqiad.wmnet with OS bullseye [19:06:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host druid1011.eqiad.wmnet with OS bull... [19:10:07] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.20 refs T325583 [19:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:10:12] T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583 [19:10:30] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @BBlack Do you think you will have time for us to move lvs2007 this Thursday the 26th at 9:45am CT 2:45 pm UTC? Tha... [19:15:22] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10BBlack) @Papaul - I can't make that slot for LVS, I have meetings a bit later that might get run over. @ssingh might be able t... [19:17:42] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ssingh) >>! In T326564#8554616, @BBlack wrote: > @Papaul - I can't make that slot for LVS, I have meetings a bit later that mig... [19:18:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to analytics-privatedata for Sam Walton - https://phabricator.wikimedia.org/T327780 (10Samwalton9) Thanks @Clement_Goubert! Confirming that I received the Kerberos email in my spam folder :) [19:18:28] (03CR) 10BBlack: [C: 03+1] Release 0.15.0-3 [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/883264 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [19:19:02] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5025.eqsin.wmnet with OS bullseye [19:19:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS bullseye executed with errors: - cp5025 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [19:19:25] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp5025.eqsin.wmnet with OS bullseye [19:19:31] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS bullseye [19:20:15] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @BBlack @ssingh thank you. So the process is depool the server, power it down I move it and power it back no changes i... [19:22:20] (03CR) 10Ssingh: [C: 03+2] Release 0.15.0-3 [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/883264 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [19:22:43] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2025.codfw.wmnet [19:24:15] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1011.eqiad.wmnet with reason: host reimage [19:24:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Jclark-ctr) 05Open→03Resolved Replaced Failed dimm thanks for your help @Marostegui [19:27:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1011.eqiad.wmnet with reason: host reimage [19:28:40] !log reprepro -C main include bullseye-wikimedia varnish-modules_0.15.0-3_amd64.changes: T326634 [19:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:44] T326634: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 [19:30:22] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2025.codfw.wmnet [19:33:00] !log cp5032: restart varnish-frontend [19:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:54] (03PS3) 10Andrew Bogott: mwopenstackclients3: Add a bunch of retrying [puppet] - 10https://gerrit.wikimedia.org/r/883238 (https://phabricator.wikimedia.org/T327375) [19:35:40] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp5032 is CRITICAL: connect to address 10.132.0.16 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:36:04] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp5032 is CRITICAL: connect to address 10.132.0.16 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:36:08] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp5032 is CRITICAL: connect to address 10.132.0.16 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:36:11] ignore the cp5032 spam, sorry [19:36:17] (it's depooled!) [19:36:18] PROBLEM - Webrequests Varnishkafka log producer on cp5032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:36:38] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp5032 is CRITICAL: connect to address 10.132.0.16 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:37:40] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5032 is CRITICAL: connect to address 10.132.0.16 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:37:54] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp5032 is CRITICAL: connect to address 10.132.0.16 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:37:54] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5032 is CRITICAL: connect to address 10.132.0.16 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:38:43] (03PS1) 10Ssingh: Release 0.4 [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/883269 (https://phabricator.wikimedia.org/T326634) [19:39:18] !log rebooting restbase cassandra nodes, row d -- T325132 [19:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:31] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2012.codfw.wmnet [19:40:37] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [19:40:39] (03CR) 10Ryan Kemper: wdqs: add recording rule for req success ratio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [19:40:41] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [19:41:21] (03CR) 10BBlack: [C: 03+1] Release 0.4 [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/883269 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [19:42:22] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients3: Add a bunch of retrying [puppet] - 10https://gerrit.wikimedia.org/r/883238 (https://phabricator.wikimedia.org/T327375) (owner: 10Andrew Bogott) [19:42:40] (03PS1) 10Ssingh: Release 1.5.3-4 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/883270 (https://phabricator.wikimedia.org/T326634) [19:42:50] (03CR) 10Ssingh: [C: 03+2] Release 0.4 [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/883269 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [19:43:08] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) @Jclark-ctr ping? [19:45:53] (03PS1) 10Ssingh: Release 1.9-3 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/883271 (https://phabricator.wikimedia.org/T326634) [19:46:06] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2012.codfw.wmnet [19:47:22] (03CR) 10BBlack: [C: 03+1] Release 1.5.3-4 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/883270 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [19:47:27] !log reprepro -C main include bullseye-wikimedia libvmod-querysort_0.4_amd64.changes: T326634 [19:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:32] T326634: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 [19:48:07] (03CR) 10BBlack: [C: 03+1] Release 1.9-3 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/883271 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [19:48:16] (03CR) 10Ssingh: [C: 03+2] Release 1.5.3-4 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/883270 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [19:49:23] 10SRE: Restore amire80 home directory on mwmaint1002 - https://phabricator.wikimedia.org/T292573 (10Dzahn) 05Open→03Stalled p:05Triage→03Medium pinged on slack [19:49:27] (03PS1) 10Gehel: miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/883272 (https://phabricator.wikimedia.org/T324667) [19:50:38] (03PS1) 10Andrew Bogott: Include python3-tenacity in openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/883273 [19:51:07] (03CR) 10Ssingh: [C: 03+2] Release 1.9-3 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/883271 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [19:51:29] (03CR) 10Andrew Bogott: [C: 03+2] Include python3-tenacity in openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/883273 (owner: 10Andrew Bogott) [19:51:58] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2017.codfw.wmnet [19:52:58] 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) [19:53:02] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5025.eqsin.wmnet with reason: host reimage [19:53:21] !log reprepro -C main include bullseye-wikimedia libvmod-re2_1.5.3-4_amd64.changes: T326634 [19:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:24] T326634: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 [19:54:34] !log reprepro -C main include bullseye-wikimedia libvmod-netmapper_1.9-3_amd64.changes: T326634 [19:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:09] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5025.eqsin.wmnet with reason: host reimage [19:58:32] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2017.codfw.wmnet [19:59:12] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp5032 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.458 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:59:44] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp5032 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.476 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:00:11] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2018.codfw.wmnet [20:00:48] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp5032 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.587 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:01:02] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp5032 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.477 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:01:02] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5032 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.588 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:02:18] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp5032 is OK: HTTP OK: HTTP/1.1 200 OK - 469 bytes in 0.614 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:03:22] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [20:04:50] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp5032 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.574 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:04:51] 10Puppet, 10Analytics-Radar, 10Infrastructure-Foundations: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104 (10Dzahn) It's been years since my last comment that it's been years. [20:05:13] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5017.eqsin.wmnet with OS bullseye [20:05:19] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5017.eqsin.wmnet with OS bullseye completed: - cp5017 (**PASS**) - Removed from Puppet and PuppetDB if present -... [20:06:24] RECOVERY - Webrequests Varnishkafka log producer on cp5032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:08:14] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2041.codfw.wmnet,service=cdn [20:08:15] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2041.codfw.wmnet,service=ats-be [20:08:39] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2018.codfw.wmnet [20:08:54] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:08:58] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=cdn [20:08:58] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be [20:09:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6009.drmrs.wmnet with OS bullseye [20:09:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6009.drmrs.wmnet with OS bullseye completed: - cp6009 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [20:12:28] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2023.codfw.wmnet [20:14:15] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2041.codfw.wmnet,service=cdn [20:14:15] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2041.codfw.wmnet,service=ats-be [20:14:21] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=cdn [20:14:21] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-be [20:16:05] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=5017.eqsin.wmnet,service=cdn [20:16:09] !log contint2001 - restarted zuul [20:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:18] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=5017.eqsin.wmnet,service=ats-be [20:16:56] !log pool cp5032 [20:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:42] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet,service=ats-be [20:18:55] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet,service=cdn [20:19:43] (03CR) 10Cwhite: [C: 03+2] logstash: enable ecs 1.11.0-5 in production [puppet] - 10https://gerrit.wikimedia.org/r/882780 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [20:19:44] PROBLEM - Confd vcl based reload on cp2041 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [20:19:44] PROBLEM - Confd vcl based reload on cp2042 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [20:19:55] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6009.drmrs.wmnet,service=cdn [20:20:04] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6009.drmrs.wmnet,service=ats-be [20:20:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2023.codfw.wmnet [20:21:13] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [20:24:36] (03CR) 10Ayounsi: [C: 03+2] logstash: Add PTR resolution to firewall logs [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [20:24:55] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2026.codfw.wmnet [20:24:56] (03CR) 10Cwhite: [C: 03+1] "Expanded the grok to match test case, conform to ecs, and updated test's expected format." [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [20:28:57] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet [20:28:58] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5025.eqsin.wmnet with OS bullseye [20:29:01] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet [20:29:03] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS bullseye completed: - cp5025 (**PASS**) - Removed from Puppet and PuppetDB if present -... [20:31:48] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet,service=cdn [20:31:54] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2026.codfw.wmnet [20:31:56] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet,service=ats-be [20:32:09] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2027.codfw.wmnet [20:32:25] jouncebot: nowandnext [20:32:26] For the next 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T1900) [20:32:26] In 0 hour(s) and 27 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T2100) [20:35:52] RECOVERY - Confd vcl based reload on cp2042 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [20:37:18] (03CR) 10BBlack: [C: 03+1] Configure transit_buffer for bullseye varnish [puppet] - 10https://gerrit.wikimedia.org/r/883246 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [20:37:20] (03CR) 10BBlack: [C: 03+2] Configure transit_buffer for bullseye varnish [puppet] - 10https://gerrit.wikimedia.org/r/883246 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [20:37:44] RECOVERY - Confd vcl based reload on cp2041 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [20:39:03] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2027.codfw.wmnet [20:40:55] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:41:18] 10SRE: Restore amire80 home directory on mwmaint1002 - https://phabricator.wikimedia.org/T292573 (10Dzahn) 05Stalled→03Resolved He said it can be closed. [20:49:24] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [20:49:48] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host sessionstore1001.eqiad.wmnet [20:50:20] !log rebooting sessionstore1001.eqiad.wmnet -- T325132 [20:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:23] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [20:54:25] (03PS2) 10Stang: newiki: Add new permissions to group reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882681 (https://phabricator.wikimedia.org/T327114) [20:57:42] Successful wiki edits has just started to drop, users reported repeated "loss of session data" persisting a refresh [20:58:34] from -tech Anybody else seeing issues with logging in right now? [20:58:49] TheresNoTime: I just restarted a session storage server [20:59:10] that should not cause a problem, but the timing is undeniable :/ [20:59:38] urandom: Yeah, I'm getting this error: "There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form. You may receive this message if you are blocking cookies." [21:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T2100). [21:00:06] nray, jan_drewniak, and cirno: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:21] the restart caused a spike of 500s [21:00:36] 500s from the storage service [21:00:54] put together T327815 very quick for timing [21:00:54] T327815: Repeated loss of session data on edit attempt - https://phabricator.wikimedia.org/T327815 [21:01:14] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host sessionstore1001.eqiad.wmnet [21:01:16] I can deploy, & holding the deploy per T327815 [21:01:32] I think we need to restart kask [21:03:10] (03PS2) 10Bking: dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) [21:03:29] !log holding UTC late backport window for outage, T327815 [21:03:30] I am reading wikitech to figure out how to do this, so if anyone knows without reading, help would be appreciated :) [21:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:34] (03CR) 10Bking: dse-k8s: add rdf-streaming-updater namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [21:03:44] PROBLEM - MediaWiki edit session loss on graphite1005 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 [21:04:21] here we go (https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_restart) [21:04:23] urandom: do you need help? should someone klaxon for help? [21:04:26] PROBLEM - MediaWiki centralauth errors on graphite1005 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [21:05:53] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync [21:05:56] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync [21:05:57] eevans@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [21:07:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:07:31] jclark@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [21:07:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1011.eqiad.wmnet with OS bullseye [21:07:36] jclark@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [21:07:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host druid1011.eqiad.wmnet with OS bullseye... [21:08:00] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host druid1009.eqiad.wmnet with OS bullseye [21:08:01] jclark@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [21:08:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host druid1009.eqiad.wmnet with OS bull... [21:08:05] Ok, that didn't seem to work [21:09:00] brett: you around? [21:09:06] Hi. [21:09:08] urandom: if you are unsure, please klaxon [21:09:13] I can't seem to login [21:09:15] ShakespeareFan00: aware of login issues [21:09:16] ShakespeareFan00: known issue [21:09:18] ShakespeareFan00: Known issue. [21:09:33] it's https://phabricator.wikimedia.org/T327815 [21:09:51] ShakespeareFan00: (and everyone else) please report issues in #wikimedia-tech in the future please, that way we can focus on solving the issue here [21:09:51] It's complaining about not getting an 'active' token [21:09:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [21:10:04] ShakespeareFan00: -tech [21:10:07] Looks like taavi already did [21:10:08] urandom: I did a manual page to get some more eyes on it [21:10:16] taavi: I saw that, thanks [21:10:35] * brett present [21:10:43] so what exaclty is broken? I can't seem to find any useful error messages from logstash [21:11:06] Loss of session data is normally a server-down issue, right? [21:11:31] yeah. urandom was apparently rebooting sessionstore1001 just a bit earlier which probably caused something to break [21:11:40] But we didn't have a network issue FWICT. [21:11:46] I restarted sessionstore1001, no impact was expected, but we're getting lots of 500s [21:11:48] Hmm. [21:12:30] sessionstore uses cassandra, right? is the cassandra cluster happy? [21:12:58] meta stuff: we need an IC, and also a wikimediastatus.net update [21:13:10] I'll do it [21:13:35] Who's looking into cassandra? [21:13:45] I have, it's find [21:13:46] https://grafana.wikimedia.org/d/-K8NgsUnz/home?orgId=1 shows a big drop in HTTP 5xx error responses starting at ~12:48 UTC, I assume that's unrelated? [21:13:47] I have, it's fine [21:13:55] I suspect it's Kask [21:14:08] I restarted, but that didn't seem to fix it [21:14:16] brett: do you know how to depool eqiad? [21:14:19] did you restart both DCs or just eqiad? [21:14:24] just eqiad [21:14:29] codfw seems ok [21:14:41] and the node I rebooted was eqiad [21:14:47] https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=now-1h&to=now has a discontinuity at 20:54 on the edit rate, which is more likely a symptom. [21:14:54] wait wait [21:14:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [21:15:00] sessionstore logs say 'Error writing to storage (Cannot achieve consistency level LOCAL_QUORUM)' [21:15:06] Loads of save failures, as people report. [21:15:12] taavi: how recent is that? [21:15:17] urandom: I could but I'm IC right now, shouldn't I be doing the meta? [21:15:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6001.drmrs.wmnet with OS bullseye [21:15:43] I actually think it's righted itself [21:15:51] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6001.drmrs.wmnet with OS bullseye [21:15:54] on a random pod I can see last one was at 21:10Z [21:15:58] Yeah, edit rate has recovered, error rate has dropped. [21:16:10] urandom: yeah people on discord are reporting it coming back [21:16:18] also seeing new entries for 'http: TLS handshake error from 10.2.2.11:48400', 10.2.2.11 is apertium.svc.eqiad.wmnet [21:16:19] taavi: good, yeah, that was the last one I saw too [21:16:21] Start 20:55, ~end 21:15 dropping from 21:11. [21:16:41] Is that from the restart? Or earlier? [21:17:01] I think it's the restart [21:17:39] Hi, sorry I'm late [21:17:41] So it looked OK in the logs but a restart made it actually work? [21:17:54] SAL says reboot cookbook started at 20:50 and ended at 21:01 [21:17:57] no no, the restart precipitated all of this [21:18:27] Oh, so it's just recovery from the restart that made things work? [21:18:55] taavi: that's not quite right, it came up before that but the cookbook was timing out on the SSL icinga alert (which is not critical) [21:19:21] (03PS1) 10Ottomata: WIP - install pyflink deps with pip [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883278 (https://phabricator.wikimedia.org/T327494) [21:19:57] OK, so do we want to declare the incident resolved? [21:20:11] James_F: the node restart set everything in motion, the node came back up but the service spiraled anyway. I kicked off a restart of the service, and while the recovery took longer than I expected, I think that was what fixed it [21:20:21] * James_F nods. [21:20:52] the errors are consistent with (a still) buggy connection pool, is while I have reasonable confidence that that is what fixed it [21:21:03] s/while/why/ [21:21:10] urandom: For clarity, which service? [21:21:17] kask/sessionstore [21:21:20] buggy connection pool or underprovisioning were my first ideas [21:21:59] sessionstore network traffic basically doubled for the duration of the outage, I'm not sure if that was because of rapid retries from clients or what [21:22:47] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:23:16] My theory is: Host goes down and connections to it are terminated (connection objects removed from the pool), the node comes back up but the connections are not reestablished (because reasons). But Cassandra sees the returned node as a viable candidate (just one that is now unreachable by the service) [21:23:28] cdanis: I was thinking retries, yeah [21:23:40] I intend to start the deployment window shortly (ping nray as first in the queue) — are we happy with that idea, or should I continue to hold? [21:23:45] Follow-ups would be a safer (drain/provision) restart procedure for kask and a less aggressive retries rate? [21:23:50] RECOVERY - MediaWiki centralauth errors on graphite1005 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [21:25:15] TheresNoTime: That should be fine. [21:25:25] James_F: actually, any safer procedure would be a work-around (though in the near-time that is worth exploring), the connection pooling needs to be fixed ultimately. A higher node count would help a TON too. [21:25:25] Oh, also I should get that Beta config thing out. [21:25:44] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1009.eqiad.wmnet with reason: host reimage [21:25:46] nray: around for https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/882727/ ? [21:25:55] @TheresNoTime Yes I am! [21:26:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882727 (https://phabricator.wikimedia.org/T327460) (owner: 10Nray) [21:26:10] urandom: Right. brett did you start a doc? [21:26:21] We should capture these thoughts. :-) [21:26:30] James_F: Not yet, writing a post-incident status page atm. Have copy/pastes in a vim buffer [21:26:39] Ack. [21:26:49] Title: "15-minute outage for editing users" [21:27:30] brett: I'd say "Editing broken for 15 minutes" but yours is good too :) [21:27:58] I've started https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_MediaWiki [21:28:35] > Wikipedia's session store suffered an outage for about 15 minutes. This caused users to be unable to log in or edit pages. Services are back online and review is in progress. [21:28:38] All good? [21:28:43] LGTM. [21:28:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1009.eqiad.wmnet with reason: host reimage [21:28:53] brett: LGTM thanks! [21:28:57] brett: LGTM [21:29:16] RECOVERY - MediaWiki edit session loss on graphite1005 is OK: OK: Less than 30.00% above the threshold [10.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 [21:29:19] should we resolve T327815? [21:29:21] T327815: Repeated loss of session data on edit attempt - https://phabricator.wikimedia.org/T327815 [21:29:36] Yes. [21:29:36] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [21:30:00] Done. [21:30:25] James_F: Filling in the incident doc now [21:30:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:30:36] brett: <3 [21:30:39] * urbanecm closed it as well. [21:31:32] TheresNoTime: If you could land https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/883201/ at some point that'd be smashing (no need to sync, just pull to deploy host). [21:31:43] James_F: sure thing [21:31:48] Ace. [21:32:27] !log samtar@deploy1002 backport aborted: (duration: 06m 28s) [21:32:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883201 (https://phabricator.wikimedia.org/T327724) (owner: 10Jforrester) [21:33:18] (03Merged) 10jenkins-bot: [BETA CLUSTER] Don't try to load Kartographer on Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883201 (https://phabricator.wikimedia.org/T327724) (owner: 10Jforrester) [21:33:33] Did anyone notice an issue before the page was sent out? [21:34:08] 21:57:42 Successful wiki edits has just started to drop, users reported repeated "loss of session data" persisting a refresh [21:34:47] Technically the "Failed to log message to wiki. Somebody should check the error logs." should have warned us. [21:35:12] Oh, but that'd be unrelated on wikitech anyway, surely? [21:35:21] brett: the page was manual [21:35:32] RhinosF1: That's right, thanks for the reminder [21:35:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6001.drmrs.wmnet with reason: host reimage [21:35:54] James_F: id hazard a guess at it must be related [21:36:39] RhinosF1: I thought wikitech didn't use kask/etc. but perhaps I'm wrong. [21:37:08] I didn't think it did either... [21:37:22] James_F: id assume it wouldn’t too but the timing is suspicious although it’s still not logging [21:37:29] Indeed. [21:37:29] So maybe wikitech is still broken [21:38:40] !log running migrateRevisionCommentTemp.php on testcommonswiki (s4) with --sleep 10 # T275246 [21:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:44] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [21:38:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6001.drmrs.wmnet with reason: host reimage [21:38:59] Log works now as it just listened to zabe [21:39:02] Oh [21:39:13] logmsgbot doesn’t say logged anymore [21:39:16] I forgot that [21:39:32] James_F: timing is far too weird. I forgot Sal didn’t always confirm. [21:39:53] Yeah. [21:41:50] (03CR) 10Samtar: [C: 03+2] "set merging for deploy" [extensions/VisualEditor] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883213 (https://phabricator.wikimedia.org/T327778) (owner: 10Jdrewniak) [21:41:54] (03CR) 10Samtar: [C: 03+2] "set merging for deploy" [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883216 (https://phabricator.wikimedia.org/T327778) (owner: 10Jdrewniak) [21:42:43] (03Merged) 10jenkins-bot: Work around sticky-positioned layers disabling subpixel rendering [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882727 (https://phabricator.wikimedia.org/T327460) (owner: 10Nray) [21:42:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882727 (https://phabricator.wikimedia.org/T327460) (owner: 10Nray) [21:43:08] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:43:14] !log samtar@deploy1002 Started scap: Backport for [[gerrit:882727|Work around sticky-positioned layers disabling subpixel rendering (T327460)]] [21:43:17] T327460: Vector 2022 body text doesn't use subpixel rendering when TOC is pinned on Chromium - https://phabricator.wikimedia.org/T327460 [21:44:33] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:44:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1009.eqiad.wmnet with OS bullseye [21:44:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host druid1009.eqiad.wmnet with OS bullseye... [21:45:00] !log samtar@deploy1002 nray and samtar: Backport for [[gerrit:882727|Work around sticky-positioned layers disabling subpixel rendering (T327460)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:45:14] nray: live on mwdebug, can you test please? [21:45:32] yes, thank you [21:47:35] (03PS3) 10Samtar: newiki: Add new permissions to group reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882681 (https://phabricator.wikimedia.org/T327114) (owner: 10Stang) [21:50:37] @TheresNoTime Things look good. You can proceed [21:50:44] ack [21:53:05] (03PS1) 10Andrew Bogott: mwopenstackclients3: increase the retry count for allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/883280 [21:53:33] jan_drewniak: FYI I'm doing the wmf.19 and wmf.20 VisualEditor 'Fix Wikitext editor preview layout in Vector 2022' patches together next after they've finished merging, sound okay? :) [21:53:56] TheresNoTime: sounds good [21:54:06] oh and James_F, that beta config patch got deployed [21:54:17] TheresNoTime: Yeah, thanks! [21:55:46] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients3: increase the retry count for allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/883280 (owner: 10Andrew Bogott) [21:56:40] (03Merged) 10jenkins-bot: Fix Wikitext editor preview layout in Vector 2022 [extensions/VisualEditor] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883213 (https://phabricator.wikimedia.org/T327778) (owner: 10Jdrewniak) [21:56:45] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:882727|Work around sticky-positioned layers disabling subpixel rendering (T327460)]] (duration: 13m 31s) [21:56:50] T327460: Vector 2022 body text doesn't use subpixel rendering when TOC is pinned on Chromium - https://phabricator.wikimedia.org/T327460 [21:56:51] nray: your patch is now live :) [21:57:04] @TheresNoTime Thanks for your help! [21:57:15] (03PS1) 10Zabe: Start reading from rev_comment_id on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883281 (https://phabricator.wikimedia.org/T299954) [21:58:23] (03Merged) 10jenkins-bot: Fix Wikitext editor preview layout in Vector 2022 [extensions/VisualEditor] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883216 (https://phabricator.wikimedia.org/T327778) (owner: 10Jdrewniak) [21:58:54] (03PS1) 10Jforrester: [BETA CLUSTER] Don't try to load Kartographer on Wikifunctions at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883282 (https://phabricator.wikimedia.org/T327724) [21:59:01] !log samtar@deploy1002 Started scap: Backport for [[gerrit:883213|Fix Wikitext editor preview layout in Vector 2022 (T327778)]], [[gerrit:883216|Fix Wikitext editor preview layout in Vector 2022 (T327778)]] [21:59:05] T327778: Pagetools: 2017 wikitext editor preview broken - https://phabricator.wikimedia.org/T327778 [21:59:10] TheresNoTime: I have a better patch if you have the time. Sorry! [21:59:21] James_F: sure :) [22:00:48] !log samtar@deploy1002 samtar and jdrewniak: Backport for [[gerrit:883213|Fix Wikitext editor preview layout in Vector 2022 (T327778)]], [[gerrit:883216|Fix Wikitext editor preview layout in Vector 2022 (T327778)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:01:00] jan_drewniak: those two patches are now live on mwdebug, can you test? [22:01:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6001.drmrs.wmnet with OS bullseye [22:02:02] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6001.drmrs.wmnet with OS bullseye completed: - cp6001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [22:02:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [22:02:50] TheresNoTime: yup good to sync [22:02:54] ack [22:04:18] (03CR) 10Samtar: "recheck" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883217 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [22:04:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6001.drmrs.wmnet,service=cdn [22:04:25] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6001.drmrs.wmnet,service=ats-be [22:04:51] (03PS1) 10Dzahn: add new language 'gur' (Gurenɛ) [dns] - 10https://gerrit.wikimedia.org/r/883283 (https://phabricator.wikimedia.org/T327813) [22:06:57] !log extending UTC late backport window due to late start [22:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:28] cirno: I'm going to do your config patch, 882681, next (and James_F if you have another config patch?) [22:08:38] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:883213|Fix Wikitext editor preview layout in Vector 2022 (T327778)]], [[gerrit:883216|Fix Wikitext editor preview layout in Vector 2022 (T327778)]] (duration: 09m 36s) [22:08:42] T327778: Pagetools: 2017 wikitext editor preview broken - https://phabricator.wikimedia.org/T327778 [22:10:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882681 (https://phabricator.wikimedia.org/T327114) (owner: 10Stang) [22:10:58] (03Merged) 10jenkins-bot: newiki: Add new permissions to group reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882681 (https://phabricator.wikimedia.org/T327114) (owner: 10Stang) [22:11:25] !log samtar@deploy1002 Started scap: Backport for [[gerrit:882681|newiki: Add new permissions to group reviewer (T327114)]] [22:11:29] T327114: Create New Page Reviewer user right in Nepali Wikipedia - https://phabricator.wikimedia.org/T327114 [22:13:10] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:882681|newiki: Add new permissions to group reviewer (T327114)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:13:33] cirno: if you're around, can you test ^ ? (I am also going to test as its fairly simple) [22:13:38] looking [22:14:26] TheresNoTime, works as expected per https://ne.wikipedia.org/w/index.php?title=Special:Listgrouprights [22:14:32] ack [22:14:38] TheresNoTime: https://gerrit.wikimedia.org/r/883282 sorry [22:14:43] (03CR) 10Dzahn: [C: 03+2] add new language 'gur' (Gurenɛ) [dns] - 10https://gerrit.wikimedia.org/r/883283 (https://phabricator.wikimedia.org/T327813) (owner: 10Dzahn) [22:14:46] (03PS2) 10Dzahn: add new language 'gur' (Gurenɛ) [dns] - 10https://gerrit.wikimedia.org/r/883283 (https://phabricator.wikimedia.org/T327813) [22:15:47] TheresNoTime: really apologize I forgot something for patch 882681, I'll add another patch in a minute [22:16:38] no problem, I'll do them both then. jan_drewniak: I noticed one of your other patches failed CI, so have re-run that - if/when it passes I'll deploy those last two together too. Should just about have enough time [22:17:10] Am I correct in that the issue resolved on its own? How did the issue get fixed? [22:17:50] TheresNoTime: yeah... I saw, np [22:19:05] !log DNS - adding new project language "gur" (Gurenɛ) - Gurenɛ is a major language of northern Ghana and the predominant language of the Upper East Region of Ghana. It is also widely spoken in Burkina Faso.. T327813 [22:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:09] T327813: Create Wikipedia Farefare (Gurene) - https://phabricator.wikimedia.org/T327813 [22:19:24] (03PS1) 10Stang: newiki: Fix wgAddGroups/wgRemoveGroups setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883285 (https://phabricator.wikimedia.org/T327114) [22:19:57] TheresNoTime ^ [22:20:02] ack [22:20:27] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:882681|newiki: Add new permissions to group reviewer (T327114)]] (duration: 09m 02s) [22:20:32] T327114: Create New Page Reviewer user right in Nepali Wikipedia - https://phabricator.wikimedia.org/T327114 [22:20:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883282 (https://phabricator.wikimedia.org/T327724) (owner: 10Jforrester) [22:20:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883285 (https://phabricator.wikimedia.org/T327114) (owner: 10Stang) [22:21:23] (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883217 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [22:21:29] (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883212 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [22:21:37] (03Merged) 10jenkins-bot: [BETA CLUSTER] Don't try to load Kartographer on Wikifunctions at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883282 (https://phabricator.wikimedia.org/T327724) (owner: 10Jforrester) [22:21:41] (03Merged) 10jenkins-bot: newiki: Fix wgAddGroups/wgRemoveGroups setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883285 (https://phabricator.wikimedia.org/T327114) (owner: 10Stang) [22:21:58] !log samtar@deploy1002 Started scap: Backport for [[gerrit:883282|[BETA CLUSTER] Don't try to load Kartographer on Wikifunctions at all (T327724)]], [[gerrit:883285|newiki: Fix wgAddGroups/wgRemoveGroups setting (T327114)]] [22:22:04] T327724: [betalabs] Wikifunctions Conosle error: ConfigException: $wgKartographerNearby requires GeoData and CirrusSearch extensions - https://phabricator.wikimedia.org/T327724 [22:23:29] (03CR) 10Jdrewniak: "recheck" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883212 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [22:23:45] !log samtar@deploy1002 jforrester and samtar and stang: Backport for [[gerrit:883282|[BETA CLUSTER] Don't try to load Kartographer on Wikifunctions at all (T327724)]], [[gerrit:883285|newiki: Fix wgAddGroups/wgRemoveGroups setting (T327114)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [22:23:58] cirno: can you test? [22:24:07] looking [22:24:19] TheresNoTime, yeah it works [22:24:30] ack [22:28:53] TheresNoTime: Thanks for everything! [22:29:16] np! [22:29:57] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:883282|[BETA CLUSTER] Don't try to load Kartographer on Wikifunctions at all (T327724)]], [[gerrit:883285|newiki: Fix wgAddGroups/wgRemoveGroups setting (T327114)]] (duration: 07m 59s) [22:30:03] T327114: Create New Page Reviewer user right in Nepali Wikipedia - https://phabricator.wikimedia.org/T327114 [22:30:03] T327724: [betalabs] Wikifunctions Conosle error: ConfigException: $wgKartographerNearby requires GeoData and CirrusSearch extensions - https://phabricator.wikimedia.org/T327724 [22:32:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883212 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [22:33:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883217 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [22:36:34] (03Merged) 10jenkins-bot: Add temporary extra grid-area for content translation extension [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883217 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [22:37:29] (03Merged) 10jenkins-bot: Add temporary extra grid-area for content translation extension [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883212 (https://phabricator.wikimedia.org/T327715) (owner: 10Jdrewniak) [22:37:56] !log samtar@deploy1002 Started scap: Backport for [[gerrit:883212|Add temporary extra grid-area for content translation extension (T327715)]], [[gerrit:883217|Add temporary extra grid-area for content translation extension (T327715)]] [22:38:01] T327715: Page tools change does not work together with Content Translation Beta - https://phabricator.wikimedia.org/T327715 [22:39:41] (03CR) 10RLazarus: "Sorry to be slow responding to this, I just got back from vacation a few days ago. Obviously too late with this comment, but if you agree," [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [22:39:43] !log samtar@deploy1002 jdrewniak and samtar: Backport for [[gerrit:883212|Add temporary extra grid-area for content translation extension (T327715)]], [[gerrit:883217|Add temporary extra grid-area for content translation extension (T327715)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:39:47] jan_drewniak: those two are live on mwdebug, can you test please? [22:41:13] TheresNoTime: yup looks good [22:41:19] ack [22:47:01] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:883212|Add temporary extra grid-area for content translation extension (T327715)]], [[gerrit:883217|Add temporary extra grid-area for content translation extension (T327715)]] (duration: 09m 04s) [22:47:05] T327715: Page tools change does not work together with Content Translation Beta - https://phabricator.wikimedia.org/T327715 [22:47:13] !log closing UTC late backport window [22:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:57] (03PS2) 10Zabe: Start reading from rev_comment_id on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883281 (https://phabricator.wikimedia.org/T299954) [23:01:11] (03CR) 10Zabe: [C: 03+2] Start reading from rev_comment_id on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883281 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [23:02:01] (03Merged) 10jenkins-bot: Start reading from rev_comment_id on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883281 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [23:02:27] !log zabe@deploy1002 Started scap: Backport for [[gerrit:883281|Start reading from rev_comment_id on testcommonswiki (T299954)]] [23:02:31] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:04:12] !log zabe@deploy1002 zabe: Backport for [[gerrit:883281|Start reading from rev_comment_id on testcommonswiki (T299954)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [23:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:10:29] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:883281|Start reading from rev_comment_id on testcommonswiki (T299954)]] (duration: 08m 02s) [23:10:33] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:14:51] PROBLEM - Check systemd state on mw2293 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:16] (03PS1) 10Jdlrobson: Moves feature classes from BODY element to HTML element [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883220 (https://phabricator.wikimedia.org/T321498) [23:39:10] (03CR) 10CI reject: [V: 04-1] Moves feature classes from BODY element to HTML element [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883220 (https://phabricator.wikimedia.org/T321498) (owner: 10Jdlrobson) [23:39:41] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:43:43] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:46:55] (03Abandoned) 10Jdlrobson: Moves feature classes from BODY element to HTML element [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883220 (https://phabricator.wikimedia.org/T321498) (owner: 10Jdlrobson)