[00:02:05] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:03:37] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:27:35] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 19.54 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [00:29:09] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 0.4806 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [00:29:33] (03CR) 10Cwhite: [C: 04-1] Add PTR resolution to firewall logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [00:30:35] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:32:07] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:34:31] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fdc18ce9280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [00:34:31] org/wiki/Search%23Administration [00:34:37] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:09] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 668, active_shards: 1512, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [00:36:09] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:36:13] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:23] (03CR) 10Cwhite: WIP: add rt_flow grokking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [01:31:15] (03PS1) 10Eevans: admin: add ksarabia to ldap_only_users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/881491 (https://phabricator.wikimedia.org/T327337) [01:39:32] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for KSarabia-WMF - https://phabricator.wikimedia.org/T327337 (10Eevans) p:05Triage→03Medium a:03Eevans [01:50:45] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/869254 (owner: 10Cwhite) [01:51:35] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/869253 (owner: 10Cwhite) [01:52:07] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/869252 (owner: 10Cwhite) [01:53:05] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/869251 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite) [02:00:11] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:13] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10Eevans) a:03Eevans Hi Jennifer, > The specific LDAP group that you want to be added to (optional): WMF and OPS In order to add you to group `ops`, I would need your managers approval;... [02:06:01] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10Eevans) p:05Triage→03Medium [02:07:47] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:26] (03CR) 10BCornwall: [C: 03+1] "You might want someone with more knowledge to look this over, but it seems benign to me." [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/879839 (https://phabricator.wikimedia.org/T321191) (owner: 10Ssingh) [02:08:58] (03PS1) 10Eevans: admin: add jebe to ldap_only_users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/881492 (https://phabricator.wikimedia.org/T327255) [02:12:46] (JobUnavailable) firing: (11) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:39] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:27] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:26] 10SRE, 10Traffic-Icebox: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10BCornwall) https://gerrit.wikimedia.org/r/c/operations/puppet/+/525490/ was merged long ago when traffic used nginx. Now that we don't use it, is this relevant? i.e. are there other teams still using ng... [03:03:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:26:57] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/881375 (https://phabricator.wikimedia.org/T327370) [05:40:15] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/881376 (https://phabricator.wikimedia.org/T327372) [05:41:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 30 hosts with reason: Primary switchover s7 T327372 [05:41:55] T327372: Switchover s7 master (db2118 -> db2121) - https://phabricator.wikimedia.org/T327372 [05:42:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s7 T327372 [05:42:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2121 with weight 0 T327372', diff saved to https://phabricator.wikimedia.org/P43188 and previous config saved to /var/cache/conftool/dbconfig/20230119-054243-ladsgroup.json [06:01:54] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/881376 (https://phabricator.wikimedia.org/T327372) (owner: 10Gerrit maintenance bot) [06:02:30] !log Starting s7 codfw failover from db2118 to db2121 - T327372 [06:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2121 to s7 primary T327372', diff saved to https://phabricator.wikimedia.org/P43189 and previous config saved to /var/cache/conftool/dbconfig/20230119-060316-ladsgroup.json [06:03:33] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [06:03:49] T327372: Switchover s7 master (db2118 -> db2121) - https://phabricator.wikimedia.org/T327372 [06:04:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2118 T327372', diff saved to https://phabricator.wikimedia.org/P43190 and previous config saved to /var/cache/conftool/dbconfig/20230119-060449-ladsgroup.json [06:06:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:06:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:11:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:11:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:12:46] (JobUnavailable) firing: (11) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:36] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [06:35:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:35:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [06:36:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [06:36:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T0700) [07:00:05] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T0700). Please do the needful. [07:03:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:45:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [07:45:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [07:58:04] (03PS2) 10Muehlenhoff: memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812173 (https://phabricator.wikimedia.org/T308013) [07:59:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881453 (https://phabricator.wikimedia.org/T319429) (owner: 10Filippo Giunchedi) [07:59:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881452 (https://phabricator.wikimedia.org/T319429) (owner: 10Filippo Giunchedi) [08:00:05] Amir1, apergos, and jnuche: Time to snap out of that daydream and deploy UTC morning backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T0800). [08:00:20] apparently it is morning. there are no trainees signed up, which is just as well since there are also no patches scheduled for deployment. and this means that any self-deployers who want to sneak something in at the last minute can do so, but please add it to the calendar for the record as well. offer expires if no one speaks up in the next 15 minutes :-) [08:00:21] any gerrit patches? [08:00:39] nope :) [08:00:53] (after 15 minutes I will wander off from paying attention in here.) [08:01:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881491 (https://phabricator.wikimedia.org/T327337) (owner: 10Eevans) [08:03:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881492 (https://phabricator.wikimedia.org/T327255) (owner: 10Eevans) [08:03:25] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10MoritzMuehlenhoff) >>! In T327255#8538985, @Eevans wrote: > Hi Jennifer, > >> The specific LDAP group that you want to be added to (optional): WMF and OPS > > In or... [08:08:53] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10Patch-For-Review: Cumin/Openstack: multi-project commands are extremely slow - https://phabricator.wikimedia.org/T325773 (10taavi) [08:09:02] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:40] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10JEbe-WMF) >>! In T327255#8539366, @MoritzMuehlenhoff wrote: >>>! In T327255#8538985, @Eevans wrote: >> Hi Jennifer, >> >>> The specific LDAP group that you want to b... [08:26:22] !log installing sudo security updates [08:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:50] 10SRE, 10Traffic-Icebox: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10MoritzMuehlenhoff) >>! In T228730#8539037, @BCornwall wrote: > https://gerrit.wikimedia.org/r/c/operations/puppet/+/525490/ was merged long ago when traffic used nginx. Now that we don't use it, is this... [08:31:54] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [08:32:47] (JobUnavailable) firing: (12) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:33:28] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [08:33:49] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "If we can now use directly the slowlog template, we should also remove the old template once that's verified" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881454 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [08:36:10] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:20] ^ fixing this [08:36:34] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:48] 10SRE, 10Infrastructure-Foundations, 10LDAP: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 (10MoritzMuehlenhoff) I ran tcpdump on both hosts for about a week and aside from random scanning scatter, all remaining connections were to ldap1.corp.wikimedia.org and alert[12]001.org. I'll be... [08:47:22] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:47] (JobUnavailable) firing: (12) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:00:05] jnuche and jeena: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T0900). [09:03:28] Reminder that I'll be rebooting mwmaint1002 at 1000UTC [09:04:24] PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:07:47] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:08:29] (03CR) 10Jelto: [C: 03+1] "lgtm. Let me know when this should be merged and tested" [puppet] - 10https://gerrit.wikimedia.org/r/881007 (https://phabricator.wikimedia.org/T327060) (owner: 10Ahmon Dancy) [09:10:00] RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:14:41] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881501 (https://phabricator.wikimedia.org/T325582) [09:14:43] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881501 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [09:15:19] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881501 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [09:15:38] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [09:16:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [09:17:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2118.codfw.wmnet with reason: Maintenance [09:17:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2118.codfw.wmnet with reason: Maintenance [09:22:18] (03CR) 10Vgutierrez: [C: 03+1] site.pp: update LVS hosts in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/866441 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [09:24:31] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.19 refs T325582 [09:24:35] T325582: 1.40.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T325582 [09:25:13] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Aklapper) [09:25:37] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Aklapper) [09:27:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ldap-corp[1001,2001].wikimedia.org with reason: Decommissioning [09:27:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ldap-corp[1001,2001].wikimedia.org with reason: Decommissioning [09:31:57] (03CR) 10Vgutierrez: [C: 03+1] Release 1.15.10 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/879839 (https://phabricator.wikimedia.org/T321191) (owner: 10Ssingh) [09:55:14] !log installing ping3003 T273509 [09:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:18] T273509: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 [10:00:34] Amir1: Am I good to procede with mwmaint reboot ? [10:03:16] Will wait two more minutes and I'll start. [10:05:50] !log Stopping maintenance scripts on mwmaint1002.eqiad.wmnet for reboot [10:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:06] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [10:06:32] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [10:07:12] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwmaint1002.eqiad.wmnet [10:11:22] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:56] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwmaint1002.eqiad.wmnet [10:15:33] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [10:15:53] (03CR) 10Filippo Giunchedi: [C: 03+2] site: apply webperf::profiling_tools to arclamp2001 [puppet] - 10https://gerrit.wikimedia.org/r/881452 (https://phabricator.wikimedia.org/T319429) (owner: 10Filippo Giunchedi) [10:15:59] (03PS2) 10Filippo Giunchedi: site: apply webperf::profiling_tools to arclamp2001 [puppet] - 10https://gerrit.wikimedia.org/r/881452 (https://phabricator.wikimedia.org/T319429) [10:16:04] (03CR) 10Filippo Giunchedi: [V: 03+2] site: apply webperf::profiling_tools to arclamp2001 [puppet] - 10https://gerrit.wikimedia.org/r/881452 (https://phabricator.wikimedia.org/T319429) (owner: 10Filippo Giunchedi) [10:17:23] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [10:17:31] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new ping host - jmm@cumin2002" [10:17:53] (03PS1) 10Btullis: Enable dashboard native filtering in Superset [puppet] - 10https://gerrit.wikimedia.org/r/881510 (https://phabricator.wikimedia.org/T318299) [10:17:56] !log Restarted maintenance scripts on mwmaint1002.eqiad.wmnet [10:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:16] Please ping me if something got borked in the process [10:19:45] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Switch php-slowlog to ecs format [deployment-charts] - 10https://gerrit.wikimedia.org/r/881454 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [10:19:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new ping host - jmm@cumin2002" [10:22:31] (03CR) 10Btullis: [C: 03+2] Enable dashboard native filtering in Superset [puppet] - 10https://gerrit.wikimedia.org/r/881510 (https://phabricator.wikimedia.org/T318299) (owner: 10Btullis) [10:22:59] (03CR) 10Filippo Giunchedi: [C: 03+2] Move arclamp to arclamp2001 [puppet] - 10https://gerrit.wikimedia.org/r/881453 (https://phabricator.wikimedia.org/T319429) (owner: 10Filippo Giunchedi) [10:23:06] (03PS2) 10Filippo Giunchedi: Move arclamp to arclamp2001 [puppet] - 10https://gerrit.wikimedia.org/r/881453 (https://phabricator.wikimedia.org/T319429) [10:23:19] (03CR) 10Filippo Giunchedi: [V: 03+2] Move arclamp to arclamp2001 [puppet] - 10https://gerrit.wikimedia.org/r/881453 (https://phabricator.wikimedia.org/T319429) (owner: 10Filippo Giunchedi) [10:24:05] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on webperf2004.codfw.wmnet with reason: decom [10:24:18] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on webperf2004.codfw.wmnet with reason: decom [10:25:01] jouncebot: nowandnext [10:25:01] For the next 0 hour(s) and 34 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T0900) [10:25:01] In 0 hour(s) and 34 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1100) [10:25:01] In 0 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1100) [10:25:13] (03Merged) 10jenkins-bot: mediawiki: Switch php-slowlog to ecs format [deployment-charts] - 10https://gerrit.wikimedia.org/r/881454 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [10:26:00] jnuche: is the train deployment done ? [10:26:29] (03CR) 10Filippo Giunchedi: [C: 03+1] admin: add ksarabia to ldap_only_users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/881491 (https://phabricator.wikimedia.org/T327337) (owner: 10Eevans) [10:26:41] claime: yep, it's all done [10:26:54] jnuche: cool, thx [10:26:58] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: clean up curator actions todo items [puppet] - 10https://gerrit.wikimedia.org/r/869251 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite) [10:27:04] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:27:07] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:27:08] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:27:10] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:27:12] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: change ecs-default clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869252 (owner: 10Cwhite) [10:27:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:27:15] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:27:16] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:27:18] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:27:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:27:24] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: change ecs-test clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869253 (owner: 10Cwhite) [10:27:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:27:28] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:27:29] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:27:31] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254 (owner: 10Cwhite) [10:27:32] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:27:44] Hmm. I should have suppressed sal logging lol [10:27:46] Sorry for the spam [10:29:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "Current name works for me, some shorter names (though a little less specific):" [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) (owner: 10Cwhite) [10:39:08] (03PS1) 10Filippo Giunchedi: Decom webperf[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/881569 (https://phabricator.wikimedia.org/T319429) [10:44:15] !log rebooting maps-eqiad for updates [10:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:44:21] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [10:44:47] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:45:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881569 (https://phabricator.wikimedia.org/T319429) (owner: 10Filippo Giunchedi) [10:45:13] (03CR) 10Arturo Borrero Gonzalez: "Thanks for including the PCC run https://puppet-compiler.wmflabs.org/output/880939/39155/" [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [10:46:18] (03CR) 10Arturo Borrero Gonzalez: Rename ceph profiles to cloudceph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [10:58:09] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom webperf[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/881569 (https://phabricator.wikimedia.org/T319429) (owner: 10Filippo Giunchedi) [10:58:48] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts webperf1004.eqiad.wmnet [11:00:05] mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1100). Please do the needful. [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1100) [11:02:33] !log filippo@cumin1001 START - Cookbook sre.dns.netbox [11:03:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:06:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1054.eqiad.wmnet with OS bullseye [11:06:26] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: webperf1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1001" [11:07:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:08:44] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: webperf1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1001" [11:08:44] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:08:45] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts webperf1004.eqiad.wmnet [11:08:53] 10SRE, 10observability, 10Patch-For-Review, 10User-fgiunchedi: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: `webperf1004.eqiad.wmnet` - webperf1004.eqiad.... [11:09:16] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts webperf2004.codfw.wmnet [11:11:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10fgiunchedi) [11:11:58] 10SRE, 10observability, 10Patch-For-Review, 10User-fgiunchedi: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed -- arclamp is hosted on arclamp1001 and webperf1004 has been decom'd [11:13:02] !log filippo@cumin1001 START - Cookbook sre.dns.netbox [11:13:48] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi) [11:17:47] (JobUnavailable) firing: (5) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:17:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1054.eqiad.wmnet with reason: host reimage [11:18:11] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: webperf2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1001" [11:20:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1054.eqiad.wmnet with reason: host reimage [11:20:55] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [11:22:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps1009.eqiad.wmnet [11:22:47] (JobUnavailable) firing: (5) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:24:48] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: webperf2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1001" [11:24:48] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:24:49] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts webperf2004.codfw.wmnet [11:24:56] 10SRE, 10serviceops, 10Patch-For-Review, 10User-fgiunchedi: service implementation tracking: arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: `webperf2004.codfw.wmnet` - webperf2004.codfw.wmn... [11:26:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10fgiunchedi) [11:26:46] 10SRE, 10serviceops, 10Patch-For-Review, 10User-fgiunchedi: service implementation tracking: arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed -- arclamp is hosted on arclamp2001 and webperf2004 has been decom'd [11:26:47] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:05] 10SRE, 10observability, 10Patch-For-Review, 10User-fgiunchedi: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10fgiunchedi) [11:29:33] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host maps1009.eqiad.wmnet [11:29:52] !log rebooting maps-codfw for updates [11:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:07] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:36:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1054.eqiad.wmnet with OS bullseye [11:55:06] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:58:13] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:04:19] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:05] !log stopping/masking slapd on ldap-corp1001/ldap-corp2001 T323820 [12:06:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [12:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:09] T323820: Retire ldap-corp cluster - https://phabricator.wikimedia.org/T323820 [12:19:45] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2009.codfw.wmnet [12:23:25] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:24] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host maps2009.codfw.wmnet [12:36:04] (03PS1) 10Elukey: WIP changeprop: add revscoring streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) [12:50:40] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10Platform Engineering, 10Traffic-Icebox, 10Performance-Team (Radar): Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10larissagaulia) [12:51:38] (03PS1) 10Majavah: P:wmcs::kubeadm::control: monitor cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/881595 (https://phabricator.wikimedia.org/T324182) [12:53:39] (03CR) 10CI reject: [V: 04-1] P:wmcs::kubeadm::control: monitor cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/881595 (https://phabricator.wikimedia.org/T324182) (owner: 10Majavah) [12:54:27] (03PS2) 10Majavah: P:wmcs::kubeadm::control: monitor cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/881595 (https://phabricator.wikimedia.org/T324182) [12:58:49] (03PS1) 10Btullis: Enable filterbox migration tool for superset [puppet] - 10https://gerrit.wikimedia.org/r/881596 (https://phabricator.wikimedia.org/T318299) [13:02:24] (03CR) 10Btullis: [C: 03+2] Enable filterbox migration tool for superset [puppet] - 10https://gerrit.wikimedia.org/r/881596 (https://phabricator.wikimedia.org/T318299) (owner: 10Btullis) [13:03:26] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:49] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: tile pregeneration and replicas bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/881597 (https://phabricator.wikimedia.org/T314472) [13:07:25] (03PS1) 10Muehlenhoff: puppetdb: No longer use the component on booworm [puppet] - 10https://gerrit.wikimedia.org/r/881598 [13:07:43] (03PS3) 10Majavah: P:wmcs::kubeadm::control: monitor cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/881595 (https://phabricator.wikimedia.org/T324182) [13:09:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::kubeadm::control: monitor cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/881595 (https://phabricator.wikimedia.org/T324182) (owner: 10Majavah) [13:09:51] (03PS1) 10Effie Mouzeli: hieradata: enable maps timers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/881599 (https://phabricator.wikimedia.org/T314472) [13:11:46] (03CR) 10Jgiannelos: [C: 03+1] hieradata: enable maps timers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/881599 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [13:12:11] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: tile pregeneration and replicas bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/881597 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [13:13:15] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: tile pregeneration and replicas bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/881597 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [13:13:25] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable maps timers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/881599 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [13:18:19] (03Merged) 10jenkins-bot: tegola-vector-tiles: tile pregeneration and replicas bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/881597 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [13:27:36] (03PS1) 10Majavah: puppetmaster: add prometheus metrics for cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/881602 [13:31:39] (03PS1) 10Aklapper: phabricator weekly changes email: Split cookie licked open tasks [puppet] - 10https://gerrit.wikimedia.org/r/881604 [13:33:46] (03PS2) 10Aklapper: phabricator weekly changes email: Split cookie licked open tasks [puppet] - 10https://gerrit.wikimedia.org/r/881604 [13:36:18] (03PS4) 10Jgiannelos: Enable Linter write namespace tag and template using core config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [13:39:22] (03PS2) 10Majavah: puppetmaster: add prometheus metrics for cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/881602 [13:45:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39175/console" [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [13:49:03] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39176/console" [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [13:49:27] (03PS1) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [13:58:55] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1400) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1400). [14:00:05] sbailey: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:01:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10rook) [14:01:32] sbailey does't seem to be around, unless they're on an alt nick that I'm not aware of [14:02:49] (03PS1) 10Marostegui: control-mariadb-client-11.0-bullseye: MariaDB 11 client file [software] - 10https://gerrit.wikimedia.org/r/881629 (https://phabricator.wikimedia.org/T326116) [14:03:31] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-11.0-bullseye: MariaDB 11 client file [software] - 10https://gerrit.wikimedia.org/r/881629 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [14:04:02] (03Merged) 10jenkins-bot: control-mariadb-client-11.0-bullseye: MariaDB 11 client file [software] - 10https://gerrit.wikimedia.org/r/881629 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [14:07:10] (03CR) 10Hnowlan: thumbor: add and use haproxy healthz lvs check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:08:54] (03PS1) 10Jgiannelos: mobileapps: Add default config value [deployment-charts] - 10https://gerrit.wikimedia.org/r/881631 [14:09:01] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:09:10] (03PS1) 10Phedenskog: prometheus: remove recording rule for CPU benchmark. [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398) [14:09:22] (03PS2) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [14:10:08] (03CR) 10CI reject: [V: 04-1] flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:14:16] (03PS1) 10Hnowlan: thumbor: add failure condition to health check [deployment-charts] - 10https://gerrit.wikimedia.org/r/881635 (https://phabricator.wikimedia.org/T233196) [14:15:56] (03PS3) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [14:16:23] (03PS4) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [14:17:25] (03CR) 10CI reject: [V: 04-1] flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:20:54] (03CR) 10MSantos: [C: 03+2] mobileapps: Add default config value [deployment-charts] - 10https://gerrit.wikimedia.org/r/881631 (owner: 10Jgiannelos) [14:25:14] (03CR) 10Herron: [C: 03+1] logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254 (owner: 10Cwhite) [14:25:36] (03CR) 10Herron: [C: 03+1] logstash: change ecs-test clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869253 (owner: 10Cwhite) [14:25:59] (03CR) 10Herron: [C: 03+1] logstash: change ecs-default clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869252 (owner: 10Cwhite) [14:26:09] taavi: Let me ping her [14:26:15] (03Merged) 10jenkins-bot: mobileapps: Add default config value [deployment-charts] - 10https://gerrit.wikimedia.org/r/881631 (owner: 10Jgiannelos) [14:26:32] (03CR) 10Herron: [C: 03+1] logstash: clean up curator actions todo items [puppet] - 10https://gerrit.wikimedia.org/r/869251 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite) [14:27:25] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf/ops LDAP groups for Kavitha Appakayala - https://phabricator.wikimedia.org/T327403 (10akosiaris) [14:28:30] (03CR) 10Ssingh: [C: 03+2] Release 1.15.10 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/879839 (https://phabricator.wikimedia.org/T321191) (owner: 10Ssingh) [14:28:46] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf/ops LDAP groups for Kavitha Appakayala - https://phabricator.wikimedia.org/T327403 (10akosiaris) I am Kavitha's onboarding buddy, so filing this task on her behalf. Kavitha is a regular full-time employee. [14:28:48] (03CR) 10Ssingh: [C: 03+2] site.pp: update LVS hosts in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/866441 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [14:28:58] (03PS1) 10Cathal Mooney: Move ping offload from ping3002 to ping3003 in esams [homer/public] - 10https://gerrit.wikimedia.org/r/881640 [14:29:15] (03PS4) 10Ssingh: site.pp: update LVS hosts in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/866441 (https://phabricator.wikimedia.org/T317247) [14:30:38] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:31:43] (03CR) 10Phedenskog: prometheus: remove recording rule for CPU benchmark. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [14:32:09] !log run populateCulComment on group2 wikis # T327290 [14:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:13] T327290: Run PopulateCulComment on all wikis - https://phabricator.wikimedia.org/T327290 [14:33:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [14:33:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2118.codfw.wmnet with reason: Maintenance [14:33:44] 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-01-31 - https://phabricator.wikimedia.org/T327404 (10Papaul) [14:33:58] 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-01-31 - https://phabricator.wikimedia.org/T327404 (10Papaul) p:05Triage→03Medium [14:34:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P43191 and previous config saved to /var/cache/conftool/dbconfig/20230119-143402-ladsgroup.json [14:35:33] 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-01-31 - https://phabricator.wikimedia.org/T327404 (10Papaul) [14:36:24] (03PS1) 10Jgiannelos: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/881643 [14:40:03] I apologize for not being on this channel, I had the wrong channel name for operations. I am online monitoring. Don;t use IRCCloud much anymore :-( [14:40:36] taavi: ^ [14:40:45] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:42:14] (03CR) 10MSantos: [C: 03+2] mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/881643 (owner: 10Jgiannelos) [14:43:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [homer/public] - 10https://gerrit.wikimedia.org/r/881640 (owner: 10Cathal Mooney) [14:45:42] (03CR) 10Cathal Mooney: [C: 03+2] Move ping offload from ping3002 to ping3003 in esams [homer/public] - 10https://gerrit.wikimedia.org/r/881640 (owner: 10Cathal Mooney) [14:46:17] (03Merged) 10jenkins-bot: Move ping offload from ping3002 to ping3003 in esams [homer/public] - 10https://gerrit.wikimedia.org/r/881640 (owner: 10Cathal Mooney) [14:46:49] (03Merged) 10jenkins-bot: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/881643 (owner: 10Jgiannelos) [14:47:14] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:47:15] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs 2023-01-31 - https://phabricator.wikimedia.org/T327404 (10Papaul) [14:47:23] (03PS4) 10Herron: kafka-logging: add kafka-logging200[45] to codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) [14:47:52] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "super fast execution. Nice work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [14:48:40] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10JEbe-WMF) [14:49:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P43192 and previous config saved to /var/cache/conftool/dbconfig/20230119-144907-ladsgroup.json [14:50:47] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10JEbe-WMF) I would also be requiring access to a Kerberos principal. cc @odimitrijevic @Snwachukwu [14:57:19] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:57:26] 10SRE-tools, 10Infrastructure-Foundations: wmflib: improve interactive.ask_input to support free-form responses - https://phabricator.wikimedia.org/T327408 (10Volans) p:05Triage→03Medium [15:01:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Marostegui) I have finished a data check and everything looks wise from a data integrity point of view. Tables checked: ` actor actor_id archive ar_id change_tag ct_id comment comment_id... [15:03:00] (03PS1) 10Zabe: Add ability to start from cuc_id to populateCucComment [extensions/CheckUser] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881609 (https://phabricator.wikimedia.org/T233004) [15:03:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:04:12] (03PS1) 10Volans: setup.py: specify python_requires [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881647 [15:04:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P43193 and previous config saved to /var/cache/conftool/dbconfig/20230119-150412-ladsgroup.json [15:04:14] (03PS1) 10Volans: interactive: log the response to ask_input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881648 [15:04:16] (03PS1) 10Volans: interactive: allow free responses in ask_input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881649 (https://phabricator.wikimedia.org/T327408) [15:04:18] (03PS1) 10Volans: setup.py: add support for Python 3.10 and 3.11 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881650 [15:04:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:56] (03CR) 10Cwhite: logstash: move blackbox-exporter logs to ecs-promblkboxexp indexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) (owner: 10Cwhite) [15:08:10] (03PS1) 10Herron: kafka-logging200[45]: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/881652 (https://phabricator.wikimedia.org/T326419) [15:10:16] (03CR) 10Herron: [C: 03+2] kafka-logging200[45]: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/881652 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [15:10:40] (03CR) 10Btullis: Rename ceph profiles to cloudceph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [15:10:44] (03PS2) 10Cwhite: logstash: move blackbox-exporter logs to ecs-probes indexes [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) [15:13:31] (03CR) 10Filippo Giunchedi: prometheus: remove recording rule for CPU benchmark. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881632 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [15:17:33] !log disable puppet on all C:memcached servers to deploy 812173 [15:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:48] (03CR) 10Filippo Giunchedi: "I like the idea in general, could you give us more context though?" [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [15:17:50] (03CR) 10Herron: [C: 03+2] kafka-logging: add kafka-logging200[45] to codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [15:18:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) (owner: 10Cwhite) [15:19:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P43194 and previous config saved to /var/cache/conftool/dbconfig/20230119-151917-ladsgroup.json [15:21:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881647 (owner: 10Volans) [15:22:44] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812173 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:22:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:23:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881648 (owner: 10Volans) [15:23:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:24:16] (03CR) 10Volans: [C: 03+2] setup.py: specify python_requires [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881647 (owner: 10Volans) [15:24:41] (03CR) 10Volans: [C: 03+2] interactive: log the response to ask_input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881648 (owner: 10Volans) [15:27:42] (03CR) 10Majavah: [V: 03+1] puppetmaster: add prometheus metrics for cert expiration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881602 (owner: 10Majavah) [15:28:53] (03Merged) 10jenkins-bot: setup.py: specify python_requires [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881647 (owner: 10Volans) [15:28:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:29:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881650 (owner: 10Volans) [15:29:51] (03Merged) 10jenkins-bot: interactive: log the response to ask_input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881648 (owner: 10Volans) [15:30:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:51] !log reprepro -C main include buster-wikimedia pybal_1.15.10_amd64.changes: T321191 [15:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:55] T321191: Cleanup pybal Prometheus metrics on monitor stop() - https://phabricator.wikimedia.org/T321191 [15:38:21] PROBLEM - SSH on logstash1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:38:51] PROBLEM - OpenSearch health check for shards on 9200 on logstash1024 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f4bff4f3280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [15:38:51] org/wiki/Search%23Administration [15:38:55] PROBLEM - Check systemd state on logstash1024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:39:17] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:40:02] ^ known? [15:40:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:40:31] RECOVERY - SSH on logstash1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:40:51] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f6232a9d280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [15:40:52] org/wiki/Search%23Administration [15:41:24] (03CR) 10Arturo Borrero Gonzalez: Rename ceph profiles to cloudceph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [15:41:29] sukhe: sadly yes, we've been running into OOMs with opensearch on those two hosts, I'll kick opensearch [15:41:47] ok :) [15:41:56] was waiting to check before pushing out the pybal release [15:42:14] !log bounce opensearch on logstash102[34] - T327161 [15:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:19] T327161: opensearch OOM on logstash102[34] - https://phabricator.wikimedia.org/T327161 [15:42:35] should recover in < 5 minutes [15:43:39] RECOVERY - OpenSearch health check for shards on 9200 on logstash1024 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 15, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 661, active_shards: 1490, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [15:43:39] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:43:43] RECOVERY - Check systemd state on logstash1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:14] !log enable puppet on C:memcached hosts [15:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:40] (03CR) 10Ahmon Dancy: Gitlab runners: Use gckeepstorage buildkitd setting to manage storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881007 (https://phabricator.wikimedia.org/T327060) (owner: 10Ahmon Dancy) [15:46:10] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab IV: Mise En Place 🍱), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10thcipriani) [15:47:55] (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [15:48:15] should recover ^ [15:48:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:49:53] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 661, active_shards: 1490, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [15:49:53] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:52:55] (LogstashNoLogsIndexed) resolved: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [15:53:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:55:25] !log update pybal to 1.15.10 on lvs4010: T321191 [15:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:29] T321191: Cleanup pybal Prometheus metrics on monitor stop() - https://phabricator.wikimedia.org/T321191 [15:58:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:03:20] (03CR) 10Eevans: [C: 03+2] admin: add ksarabia to ldap_only_users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/881491 (https://phabricator.wikimedia.org/T327337) (owner: 10Eevans) [16:05:37] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10odimitrijevic) Approved [16:06:18] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [16:07:27] (03PS2) 10Eevans: admin: add jebe to ldap_only_users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/881492 (https://phabricator.wikimedia.org/T327255) [16:08:11] (03CR) 10Eevans: [C: 03+2] admin: add jebe to ldap_only_users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/881492 (https://phabricator.wikimedia.org/T327255) (owner: 10Eevans) [16:08:52] (03PS1) 10MVernon: swift: make rclone less fussy [puppet] - 10https://gerrit.wikimedia.org/r/881662 (https://phabricator.wikimedia.org/T327253) [16:08:56] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [16:09:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [16:11:22] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['druid1009'] [16:11:30] (03CR) 10Btullis: Rename ceph profiles to cloudceph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [16:13:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['druid1009'] [16:14:15] (03PS8) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [16:18:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=1) rolling restart_daemons on A:logstash-collector [16:22:16] (03PS2) 10Elukey: changeprop: add liftwing revscoring streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) [16:22:18] (03PS1) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) [16:22:36] (03CR) 10Elukey: changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:25:19] (03CR) 10Btullis: "There is something funny going on because these two hosts don't even exist, as far as I can see:" [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [16:25:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:27:16] !log installing cryptsetup updates for bullseye [16:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:30] jouncebot, nowandnext [16:27:30] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [16:27:30] In 0 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1700) [16:28:55] (03CR) 10Zabe: [C: 03+2] Add ability to start from cuc_id to populateCucComment [extensions/CheckUser] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881609 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [16:30:28] (03CR) 10Klausman: [C: 03+1] helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:30:46] (03PS1) 10Jelto: sre.gitlab.upgrade: improve runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/881666 (https://phabricator.wikimedia.org/T323569) [16:30:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:32:13] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/881666 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:32:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:08] (03PS2) 10Jelto: sre.gitlab.upgrade: improve runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/881666 (https://phabricator.wikimedia.org/T323569) [16:35:52] (03CR) 10Jelto: sre.gitlab.upgrade: improve runtime_description (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/881666 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:36:10] (03PS8) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [16:36:45] 10SRE, 10Observability-Metrics, 10Traffic-Icebox: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10BCornwall) 05Stalled→03Resolved a:03BCornwall Upon further excavation, I found I misunderstood how varnishncsa was invoked! Even though there is a... [16:36:53] (03CR) 10Klausman: changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:37:07] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/881666 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:37:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:38:27] (03CR) 10Ilias Sarantopoulos: "minor glitch, all other LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:38:42] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39180/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:39:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881609 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [16:42:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2038.codfw.wmnet with OS bullseye [16:42:31] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "👍" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:42:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:43:30] (03CR) 10Elukey: changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:44:34] (03Merged) 10jenkins-bot: Add ability to start from cuc_id to populateCucComment [extensions/CheckUser] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881609 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [16:44:50] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881609|Add ability to start from cuc_id to populateCucComment (T233004)]] [16:44:54] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [16:47:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:48:22] !log roll-restart opensearch-dashboards in logstash collectors eqiad - T327161 [16:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:26] T327161: opensearch OOM on logstash102[34] - https://phabricator.wikimedia.org/T327161 [16:49:07] (03PS1) 10Btullis: Correct the ceph mgr and mon keys in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/881670 (https://phabricator.wikimedia.org/T326945) [16:49:39] (03CR) 10Klausman: "zeroth" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:50:10] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: improve runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/881666 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:50:59] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39181/console" [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [16:52:01] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: improve runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/881666 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:54:18] (03CR) 10Btullis: [C: 03+1] Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [16:54:28] !log zabe@deploy1002 backport aborted: (duration: 15m 22s) [16:54:35] !log zabe@deploy1002 Started scap: T233004 [16:54:37] (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:54:38] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [16:55:34] (03CR) 10Elukey: changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:56:33] (03CR) 10Ilias Sarantopoulos: changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:58:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "Seems sensible" [puppet] - 10https://gerrit.wikimedia.org/r/881662 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [16:58:35] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2038.codfw.wmnet with reason: host reimage [16:59:22] (03CR) 10Btullis: [V: 03+2 C: 03+2] Correct the ceph mgr and mon keys in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/881670 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [17:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:46] (03CR) 10Klausman: changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [17:02:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2038.codfw.wmnet with reason: host reimage [17:03:47] (03CR) 10Elukey: changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [17:07:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [17:13:26] !log zabe@deploy1002 Finished scap: T233004 (duration: 18m 50s) [17:13:30] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [17:17:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2038.codfw.wmnet with OS bullseye [17:17:14] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KSarabia-WMF - https://phabricator.wikimedia.org/T327337 (10Eevans) 05Open→03Resolved Hi @KSarabia-WMF I've added you to the `wmf` LDAP group (`uid=ksarabia`) and the #wmf-nda Phabricator group. I'll close this issue now, but please reopen if yo... [17:21:58] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-01-16-121758-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881673 [17:36:12] !log bash Krinkle> Vatican Interm Papacy Runbook, § 5.1: Notify Wikipedia about incoming traffic. [17:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:30] ugh [17:36:36] !bash Krinkle> Vatican Interm Papacy Runbook, § 5.1: Notify Wikipedia about incoming traffic. [17:36:37] Amir1: Stored quip at https://bash.toolforge.org/quip/yhUay4UB8Fs0LHO5nKgS [17:38:27] lol [17:38:52] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf/ops LDAP groups for Kavitha Appakayala - https://phabricator.wikimedia.org/T327403 (10Eevans) p:05Triage→03Medium a:03Eevans [17:39:04] (03PS1) 10Eevans: admin: add kappakayala to ldap_only_users (wmf/ops) [puppet] - 10https://gerrit.wikimedia.org/r/881675 (https://phabricator.wikimedia.org/T327403) [17:41:15] (03PS1) 10Jdlrobson: Enable Page tools on viwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881677 (https://phabricator.wikimedia.org/T327348) [17:51:05] 10SRE, 10Traffic, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 2 others: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10DannyS712) >>! In T238285#7981540, @Xover wrote: > @BBlack The last status update on this... [17:56:21] (03CR) 10Dzahn: [C: 03+2] "tested the 3 new queries, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/881604 (owner: 10Aklapper) [17:58:46] (03CR) 10Dzahn: [C: 03+1] "confirmed in Namely and Google" [puppet] - 10https://gerrit.wikimedia.org/r/881675 (https://phabricator.wikimedia.org/T327403) (owner: 10Eevans) [18:00:05] bd808: Dear deployers, time to do the Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1800). [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1800) [18:00:06] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-01-16-121758-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881673 (owner: 10BryanDavis) [18:01:45] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:02:03] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:03:32] (03CR) 10Dzahn: [C: 03+2] "deploying at will" [puppet] - 10https://gerrit.wikimedia.org/r/881007 (https://phabricator.wikimedia.org/T327060) (owner: 10Ahmon Dancy) [18:03:59] Thanks mutante [18:04:35] no problem dancy, applied on runner1002 [18:04:56] lgtm, running puppet via cumin on all runners [18:05:08] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [18:05:48] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-01-16-121758-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881673 (owner: 10BryanDavis) [18:06:02] dancy: all done and ready for optional testing [18:06:04] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [18:06:44] well, not entirely sure if "test" is a thing ... probably just us checking on disk space :) [18:06:51] if it cleaned up..next time [18:07:17] nod. I guess I could use a link to some graphs of the storage space used on runners [18:07:33] looks [18:08:04] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [18:08:38] dancy: https://grafana.wikimedia.org/d/H6fikj0nk/gitlab-runner-detail?orgId=1&refresh=30s exists but has CPU/disk/RAM and not disk space..maybe [18:08:45] we should add that.. hm [18:08:48] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [18:09:30] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page had an unexpected value for header content-security-policy: default-src none: connect-src app://*.wikipedia.org https://*.wikipedia.org: media-src app://upload.wikimedia.org https://upload.wikimedia.org self: img-src app://*.wikimedia.org [18:09:30] /*.wikimedia.org app://wikimedia.org https://wikimedia.org self data:: object-src none: script-src app://meta.wikimedia.org https://meta.wikimedia.org unsafe-inline: style-src app://meta.wikimedia.org https://meta.wikimedia.org app://*.wikipedia.org https://*.wikipedia.org self unsafe-inline: frame-ancestors self https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:11:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10BTullis) >>! In T325004#8525568, @taavi wrote: > Re-opening. The developer account `Hxi-ctr` has shell name `xihua`, not `hxi-ctr` which was added... [18:11:56] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page had an unexpected value for header content-security-policy: default-src none: connect-src app://*.wikipedia.org https://*.wikipedia.org: media-src app://upload.wikimedia.org https://upload.wikimedia.org self: img-src app://*.wikimedia.org [18:11:56] /*.wikimedia.org app://wikimedia.org https://wikimedia.org self data:: object-src none: script-src app://meta.wikimedia.org https://meta.wikimedia.org unsafe-inline: style-src app://meta.wikimedia.org https://meta.wikimedia.org app://*.wikipedia.org https://*.wikipedia.org self unsafe-inline: frame-ancestors self https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:12:28] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:12:35] dancy: in another generic dashboard I have disk utilization and disk saturation.. missing simple disk space left though [18:12:51] we'll look into that [18:12:55] OK sounds good. [18:13:17] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:16:24] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:16:58] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:17:03] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:17:23] sukhe: one more question about labweb-ssl:7443 probe the other day... the alert I logged in T327190 doesn't seem to reference a datacenter... would that fire if a labweb service went down in either center? Is it possible to downtime one or the other? [18:17:24] T327190: Improve horizon downtime process - https://phabricator.wikimedia.org/T327190 [18:17:49] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:18:02] andrewbogott: that is the LVS alert, and labweb only has LVS load balancing setup in eqiad [18:18:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:18:48] taavi: oh... I keep thinking that I have both DCs configured but that would explain things a bit. [18:23:22] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KSarabia-WMF - https://phabricator.wikimedia.org/T327337 (10KSarabia-WMF) Thank you! [18:29:16] (03CR) 10Dzahn: [C: 04-1] "the appserver cluster apaches don't have phabricator.wikimedia.org as a virtual host. we'll have to do this in the phabricator server conf" [puppet] - 10https://gerrit.wikimedia.org/r/863229 (https://phabricator.wikimedia.org/T324311) (owner: 10Aklapper) [18:31:59] (03CR) 10Dzahn: "I think nowadays I am more for this than I used to be in the past. also based on https://phabricator.wikimedia.org/T228759#7814326 + 0.5 " [puppet] - 10https://gerrit.wikimedia.org/r/699493 (https://phabricator.wikimedia.org/T228759) (owner: 10Aklapper) [18:33:13] (03PS1) 10Ssingh: Release 0.44.0+ds1-2 [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) [18:33:37] (03PS2) 10Ssingh: Release 0.44.0+ds1-2 [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) [18:36:36] !log re-start populateCucComment on wikidatawiki post-mwmaint-reboot in screen with --sleep 2, will take ~30 hours # T233004 [18:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:40] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [18:37:10] (03PS1) 10Jdlrobson: Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881612 (https://phabricator.wikimedia.org/T327423) [18:37:20] (03CR) 10Jdlrobson: [C: 03+1] Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881612 (https://phabricator.wikimedia.org/T327423) (owner: 10Jdlrobson) [18:37:44] (03CR) 10Dzahn: "needs manual rebase but I can do that" [puppet] - 10https://gerrit.wikimedia.org/r/841578 (owner: 10Dduvall) [18:38:11] (03CR) 10CI reject: [V: 04-1] Release 0.44.0+ds1-2 [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [18:38:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:40:13] (03PS2) 10Dzahn: P:gitlab::runner: Enforce Wmflib::POSIX::Variables type for environment [puppet] - 10https://gerrit.wikimedia.org/r/841578 (owner: 10Dduvall) [18:41:56] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/841578/39184/gitlab-runner1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841578 (owner: 10Dduvall) [18:43:52] (03CR) 10Ssingh: "E: cadvisor: statically-linked-binary [usr/bin/cadvisor]" [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [18:44:10] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/841578/39185/" [puppet] - 10https://gerrit.wikimedia.org/r/841578 (owner: 10Dduvall) [18:48:48] (03PS3) 10Ssingh: Release 0.44.0+ds1-2 [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) [18:51:59] (03PS1) 10Dzahn: miscweb: remove racktables profile from miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/881694 (https://phabricator.wikimedia.org/T327405) [18:54:24] (03PS1) 10Dzahn: racktables: delete profile and entire module [puppet] - 10https://gerrit.wikimedia.org/r/881696 (https://phabricator.wikimedia.org/T327405) [18:56:42] (03PS1) 10Dzahn: idp: remove racktables related settings [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) [19:00:02] (03PS1) 10Dzahn: trafficserver/cache::text: remove racktables [puppet] - 10https://gerrit.wikimedia.org/r/881699 (https://phabricator.wikimedia.org/T327405) [19:00:05] jnuche and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T1900). [19:01:59] (03PS1) 10Dzahn: mariadb: remove grants and settings for racktables db [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) [19:02:57] (03PS1) 10Dzahn: debmonitor: remove racktables links [puppet] - 10https://gerrit.wikimedia.org/r/881702 (https://phabricator.wikimedia.org/T327405) [19:04:05] (03CR) 10Dzahn: "Is this still a thing or outdated patch? added Jelto, fyi" [puppet] - 10https://gerrit.wikimedia.org/r/684487 (owner: 10Jbond) [19:06:42] (03PS2) 10Dzahn: debmonitor: remove racktables links [puppet] - 10https://gerrit.wikimedia.org/r/881702 (https://phabricator.wikimedia.org/T327405) [19:07:52] (03PS2) 10Dzahn: idp: remove racktables related settings [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) [19:11:54] (03PS1) 10Majavah: scap: drop config for .wmflabs domains [puppet] - 10https://gerrit.wikimedia.org/r/881703 [19:12:46] (03CR) 10Majavah: "to verify, run this on deployment-cumin.deployment-prep.eqiad1.wikimedia.cloud:" [puppet] - 10https://gerrit.wikimedia.org/r/881703 (owner: 10Majavah) [19:15:11] (03CR) 10Andrew Bogott: [C: 03+2] P:prometheus::beta: swap prometheus-labs-targets with a puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/851101 (owner: 10Majavah) [19:15:36] (03CR) 10Andrew Bogott: [C: 03+2] prometheus::wmcs_scripts: deleted unused class [puppet] - 10https://gerrit.wikimedia.org/r/851102 (owner: 10Majavah) [19:15:44] (03PS2) 10Andrew Bogott: prometheus::wmcs_scripts: deleted unused class [puppet] - 10https://gerrit.wikimedia.org/r/851102 (owner: 10Majavah) [19:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:20:20] (03PS3) 10Ssingh: hiera: enable haproxy systemd hardening on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) [19:22:37] (03PS1) 10Zabe: Start reading from cuc_comment_id everywhere except wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881706 (https://phabricator.wikimedia.org/T233004) [19:22:53] mutante: Can you run this on deploy1002 in /srv/mwbuilder/release: [19:22:53] git remote set-url origin https://gitlab.wikimedia.org/repos/releng/release.git [19:22:53] git fetch origin [19:22:53] git reset --hard origin/master [19:23:07] cc jeena [19:23:25] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39186/console" [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [19:23:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:24:19] (03CR) 10Ssingh: [V: 03+1] "Let's aim to merge this next week (23 Jan)." [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [19:47:28] (03CR) 10Zabe: [C: 03+2] Start reading from cuc_comment_id everywhere except wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881706 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [19:48:15] (03Merged) 10jenkins-bot: Start reading from cuc_comment_id everywhere except wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881706 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [19:48:34] (03PS1) 10Bking: apt-repo: add elastic7 components, drop elastic6 components [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) [19:48:37] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881706|Start reading from cuc_comment_id everywhere except wikidatawiki (T233004)]] [19:48:41] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [19:48:54] (03CR) 10CI reject: [V: 04-1] apt-repo: add elastic7 components, drop elastic6 components [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) (owner: 10Bking) [19:49:02] zabe: Lemme know if you have any troubles w/ scap [19:49:37] !log zabe@deploy1002 zabe: Backport for [[gerrit:881706|Start reading from cuc_comment_id everywhere except wikidatawiki (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [19:50:12] dancy, it threw https://phabricator.wikimedia.org/P43196, is that a problem? [19:50:26] (03PS2) 10Bking: apt-repo: add elastic7 components, drop elastic6 components [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) [19:50:42] Yep, that's what I was expecting. [19:50:56] We'll need assistance from an SRE to wrap that up. [19:52:09] Lemme see if I can hack something [19:52:26] (03CR) 10Gehel: [C: 03+1] "LGTM in principles. I don't know enough about reprepro to be sure, but it seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) (owner: 10Bking) [19:53:00] dancy, ok, but is that a reason to abort the sync? [19:53:14] oh did it keep going? [19:53:30] That will mean that k8s deployments will not be updated. [19:54:20] dancy, yeah https://phabricator.wikimedia.org/P43197 [19:55:10] why does a scap sync depend on mediawiki/tools/release? [19:55:43] The container image building stuff is in that repo [19:55:51] It was built independently of scap [19:56:10] ah [19:56:36] ok.. hacks applied. [19:56:58] After the currently sync is done, if you don't have any more to do, please run `scap sync-world` to update the k8s deployments. [19:57:07] mutante: I did the did using docker. [19:57:34] will do [19:59:28] (03CR) 10Muehlenhoff: "Looks good in general, two comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) (owner: 10Bking) [20:02:39] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881706|Start reading from cuc_comment_id everywhere except wikidatawiki (T233004)]] (duration: 14m 01s) [20:02:43] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:05:35] !log zabe@deploy1002 Started scap: fix k8s drift [20:13:38] !log zabe@deploy1002 Finished scap: fix k8s drift (duration: 08m 02s) [20:17:18] (03PS1) 10Majavah: openstack: encapi: create parent directories for files [puppet] - 10https://gerrit.wikimedia.org/r/881711 [20:19:03] (03PS2) 10Majavah: openstack: encapi: create parent directories for files [puppet] - 10https://gerrit.wikimedia.org/r/881711 [20:25:31] (03CR) 10Gehel: [C: 03+1] apt-repo: add elastic7 components, drop elastic6 components (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) (owner: 10Bking) [20:34:09] (03CR) 10Bking: apt-repo: add elastic7 components, drop elastic6 components (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) (owner: 10Bking) [20:38:53] (03CR) 10Bking: apt-repo: add elastic7 components, drop elastic6 components (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) (owner: 10Bking) [20:38:59] (03CR) 10Bking: [C: 03+2] apt-repo: add elastic7 components, drop elastic6 components [puppet] - 10https://gerrit.wikimedia.org/r/881710 (https://phabricator.wikimedia.org/T318820) (owner: 10Bking) [20:47:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:00:05] brennen and TheresNoTime: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T2100). Please do the needful. [21:00:05] jan_drewniak: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:23] o/ [21:01:33] I'm the only one on the agenda, and I don't mind backporting my own patches [21:02:19] jan_drewniak: go for it. :) [21:02:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:03:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:03:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881612 (https://phabricator.wikimedia.org/T327423) (owner: 10Jdlrobson) [21:06:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:06:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:11:19] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2039.codfw.wmnet with OS bullseye [21:13:02] (03CR) 10Dzahn: [C: 03+2] scap: drop config for .wmflabs domains [puppet] - 10https://gerrit.wikimedia.org/r/881703 (owner: 10Majavah) [21:14:14] (03CR) 10Ahmon Dancy: "Happy to see this. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/881703 (owner: 10Majavah) [21:17:15] jouncebot: now [21:17:16] For the next 0 hour(s) and 42 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230119T2100) [21:18:35] (03Merged) 10jenkins-bot: Fix grid blowout with limited width turned off [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/881612 (https://phabricator.wikimedia.org/T327423) (owner: 10Jdlrobson) [21:18:51] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:881612|Fix grid blowout with limited width turned off (T327423)]] [21:18:55] T327423: Horizontal scrolling when content contains extra wide elements when limited width is disabled and page tools is enabled - https://phabricator.wikimedia.org/T327423 [21:20:11] 10SRE, 10Traffic-Icebox: ATS memory leak upon removing healthchecks.so from configuration - https://phabricator.wikimedia.org/T255120 (10BCornwall) 05Open→03Declined Is this a ticket that we'd want any action on? Removal of healthcheck.so is unlikely to ever happen again (we're not dynamically loading/unlo... [21:20:39] !log jdrewniak@deploy1002 jdlrobson and jdrewniak: Backport for [[gerrit:881612|Fix grid blowout with limited width turned off (T327423)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:20:44] !log cwhite@deploy1002 Started deploy [releng/phatality@e0bb573]: (no justification provided) [21:20:58] !log cwhite@deploy1002 Finished deploy [releng/phatality@e0bb573]: (no justification provided) (duration: 00m 13s) [21:27:00] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2039.codfw.wmnet with reason: host reimage [21:27:04] (03CR) 10Cwhite: [C: 03+2] logstash: add logstash103[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/881373 (https://phabricator.wikimedia.org/T327338) (owner: 10Cwhite) [21:27:14] (03PS1) 10Herron: Revert "kafka-logging200[45]: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/881615 [21:27:17] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:881612|Fix grid blowout with limited width turned off (T327423)]] (duration: 08m 26s) [21:27:21] T327423: Horizontal scrolling when content contains extra wide elements when limited width is disabled and page tools is enabled - https://phabricator.wikimedia.org/T327423 [21:29:33] (03PS3) 10Cwhite: site: assign logging::opensearch::data role to logstash103[67] [puppet] - 10https://gerrit.wikimedia.org/r/881372 (https://phabricator.wikimedia.org/T327338) [21:30:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881677 (https://phabricator.wikimedia.org/T327348) (owner: 10Jdlrobson) [21:30:17] (03PS2) 10Jdrewniak: Enable Page tools on viwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881677 (https://phabricator.wikimedia.org/T327348) (owner: 10Jdlrobson) [21:30:35] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881677 (https://phabricator.wikimedia.org/T327348) (owner: 10Jdlrobson) [21:31:18] (03Merged) 10jenkins-bot: Enable Page tools on viwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881677 (https://phabricator.wikimedia.org/T327348) (owner: 10Jdlrobson) [21:31:31] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:881677|Enable Page tools on viwiki and itwiki (T327348)]] [21:31:35] T327348: Deploy page tools to itwiki and viwiki - https://phabricator.wikimedia.org/T327348 [21:31:36] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2039.codfw.wmnet with reason: host reimage [21:32:03] (03CR) 10Herron: [C: 03+2] Revert "kafka-logging200[45]: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/881615 (owner: 10Herron) [21:33:14] !log jdrewniak@deploy1002 jdlrobson and jdrewniak: Backport for [[gerrit:881677|Enable Page tools on viwiki and itwiki (T327348)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:35:13] (03PS1) 10BCornwall: tlsproxy: Remove ssl_dyn_rec support [puppet] - 10https://gerrit.wikimedia.org/r/881717 (https://phabricator.wikimedia.org/T228730) [21:37:39] 10SRE, 10SRE-Access-Requests: Requesting access to WMF Production for Kavitha Appakayala - https://phabricator.wikimedia.org/T327450 (10Kappakayala) [21:42:10] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:881677|Enable Page tools on viwiki and itwiki (T327348)]] (duration: 10m 38s) [21:42:14] T327348: Deploy page tools to itwiki and viwiki - https://phabricator.wikimedia.org/T327348 [21:46:53] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2039.codfw.wmnet with OS bullseye [21:48:45] (03CR) 10Ahmon Dancy: [C: 03+1] ci: add contint2002 to firewall, jenkins and zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/867703 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [21:49:23] (03CR) 10Eevans: [C: 03+2] admin: add kappakayala to ldap_only_users (wmf/ops) [puppet] - 10https://gerrit.wikimedia.org/r/881675 (https://phabricator.wikimedia.org/T327403) (owner: 10Eevans) [21:49:48] (03CR) 10Ahmon Dancy: [C: 03+1] ci: add contint2002 to zuul_merger firewall, ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/867710 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [21:51:12] (03CR) 10Ahmon Dancy: ci: add contint2002 as an migration rsync source host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867711 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [22:00:46] (03PS1) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881718 [22:02:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881718 (owner: 10Zabe) [22:03:30] (03Merged) 10jenkins-bot: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881718 (owner: 10Zabe) [22:05:04] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf/ops LDAP groups for Kavitha Appakayala - https://phabricator.wikimedia.org/T327403 (10Eevans) 05Open→03Resolved All done! @Kappakayala has been added to the `wmf` & `ops` LDAP groups, and the #wmf-nda Phabricator group. I'll close th... [22:12:10] (03PS3) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [22:12:37] (03CR) 10Bking: flink-operator: bump version to 1.3.1 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [22:13:59] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10Eevans) 05Open→03Resolved >>! In T327255#8539372, @JEbe-WMF wrote: > > [ ... ] > > I am not exactly certain. Because I am new, I am not sure what I need and don't but it is stated on... [22:19:12] (03CR) 10Cwhite: [C: 03+2] site: assign logging::opensearch::data role to logstash103[67] [puppet] - 10https://gerrit.wikimedia.org/r/881372 (https://phabricator.wikimedia.org/T327338) (owner: 10Cwhite) [22:20:13] (03CR) 10Ottomata: "Awesome! 1 lil nit." [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [22:21:11] (03CR) 10Ottomata: "FYI, We'll need the image upgraded too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [22:28:49] eevans: thank you [22:31:23] (03CR) 10BBlack: [C: 03+1] tlsproxy: Remove ssl_dyn_rec support [puppet] - 10https://gerrit.wikimedia.org/r/881717 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [22:42:06] (03PS3) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318) [22:48:24] (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [22:49:28] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881702 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [22:50:33] (03CR) 10Cwhite: [C: 03+2] logstash: move blackbox-exporter logs to ecs-probes indexes [puppet] - 10https://gerrit.wikimedia.org/r/881370 (https://phabricator.wikimedia.org/T327308) (owner: 10Cwhite) [23:33:01] (03PS2) 10BCornwall: tlsproxy: Remove ssl_dyn_rec support [puppet] - 10https://gerrit.wikimedia.org/r/881717 (https://phabricator.wikimedia.org/T228730) [23:35:52] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39187/console" [puppet] - 10https://gerrit.wikimedia.org/r/881717 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [23:37:40] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39188/console" [puppet] - 10https://gerrit.wikimedia.org/r/881717 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [23:46:30] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:48:04] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops