[00:03:54] yeah, self-recovered and looks stable -- we still need to make adjustments for this (serviceops discussed it some today) but nothing needed in immediate response to the page [00:05:32] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:16] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:12:51] (03PS2) 10Dave Pifke: coal: use Python 3, add cachelib dependency [puppet] - 10https://gerrit.wikimedia.org/r/774512 (https://phabricator.wikimedia.org/T301638) [00:20:08] (03CR) 10Dave Pifke: coal: use Python 3, add cachelib dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774512 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [00:21:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 87 probes of 672 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:28:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 59 probes of 672 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:38:38] 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10wiki_willy) a:03Cmjohnson [00:39:11] 10ops-eqiad, 10decommission-hardware: decommission kubernetes100[1-4] - https://phabricator.wikimedia.org/T303044 (10wiki_willy) a:03Cmjohnson [00:45:28] (03PS1) 10Ladsgroup: dbtools: Add master_finder.py [software] - 10https://gerrit.wikimedia.org/r/774585 (https://phabricator.wikimedia.org/T281249) [00:45:50] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) Here you are ^ [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T0100) [01:35:36] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:21] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:20] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:16] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:45:21] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:34] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:46:34] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: lists1001 - Icinga CRIT alerts - https://phabricator.wikimedia.org/T304886 (10Dzahn) p:05Triageβ†’03Medium [01:47:33] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: lists1001 - Icinga CRIT alerts - https://phabricator.wikimedia.org/T304886 (10Dzahn) [01:48:40] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), and 2 others: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10Dzahn) [01:57:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:58:58] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:07:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.5 [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774588 [02:07:29] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.5 [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774588 (owner: 10TrainBranchBot) [02:07:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:08:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:17:28] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Connect - Orange, AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:23:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.5 [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774588 (owner: 10TrainBranchBot) [02:29:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:15] 10SRE, 10serviceops, 10Patch-For-Review: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) Another way I'd like to improve this is to deal with Puppet skew on the two hosts. Right now, if a patch changes both appserver behavior (say, Apache config) //and// the httpbb tests, th... [02:39:14] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 21, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:57:28] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:10:45] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:42:38] 10SRE, 10Thumbor, 10Traffic, 10affects-Kiwix-and-openZIM: MWoffliner scrapes slowed down by Thumbor failure throttling 429s - https://phabricator.wikimedia.org/T304814 (10Kelson) > Of the 4 Thumbor throttles, only 1 is per-IP address. The other three are based on the original file (failure or concurrency)... [03:58:38] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:34:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300775)', diff saved to https://phabricator.wikimedia.org/P23436 and previous config saved to /var/cache/conftool/dbconfig/20220329-043428-marostegui.json [04:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:35] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [04:49:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P23437 and previous config saved to /var/cache/conftool/dbconfig/20220329-044933-marostegui.json [04:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:02] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 20 hosts with reason: Primary switchover s3 T301850 [05:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:07] T301850: Switchover s3 master (db1157 -> db1123) - https://phabricator.wikimedia.org/T301850 [05:02:16] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 20 hosts with reason: Primary switchover s3 T301850 [05:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1123 with weight 0 T301850', diff saved to https://phabricator.wikimedia.org/P23438 and previous config saved to /var/cache/conftool/dbconfig/20220329-050234-root.json [05:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:12] (03PS2) 10Marostegui: mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/774373 (https://phabricator.wikimedia.org/T301850) [05:04:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P23439 and previous config saved to /var/cache/conftool/dbconfig/20220329-050438-marostegui.json [05:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:53] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/774769 [05:16:38] Database lag? [05:17:10] " [05:17:10] Changes newer than 18 seconds may not be shown in this list. [05:17:10] " [05:17:12] oops [05:17:21] HexChat didn't have a newline thing [05:17:53] Bsadowski1: on which wiki? we are about to do some maintenance on s3, so you might experience some lag now [05:19:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300775)', diff saved to https://phabricator.wikimedia.org/P23440 and previous config saved to /var/cache/conftool/dbconfig/20220329-051943-marostegui.json [05:19:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1134.eqiad.wmnet with reason: Maintenance [05:19:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1134.eqiad.wmnet with reason: Maintenance [05:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:50] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [05:19:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T300775)', diff saved to https://phabricator.wikimedia.org/P23441 and previous config saved to /var/cache/conftool/dbconfig/20220329-051951-marostegui.json [05:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:03] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/774373 (https://phabricator.wikimedia.org/T301850) (owner: 10Marostegui) [05:22:46] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:23:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [05:23:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [05:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298556)', diff saved to https://phabricator.wikimedia.org/P23442 and previous config saved to /var/cache/conftool/dbconfig/20220329-052331-marostegui.json [05:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:36] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [05:24:52] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:32:58] (03PS1) 10Hashar: ci: relocate castor storage directory [puppet] - 10https://gerrit.wikimedia.org/r/774771 [05:35:58] (03PS2) 10Hashar: ci: relocate castor storage directory [puppet] - 10https://gerrit.wikimedia.org/r/774771 (https://phabricator.wikimedia.org/T252071) [05:40:51] (03CR) 10Hashar: [C: 03+1] "PS2 fixed some puppet-lint arrows alignment issue. I have cherry picked it on the integration puppet master and it passes on both hosts." [puppet] - 10https://gerrit.wikimedia.org/r/774525 (https://phabricator.wikimedia.org/T252071) (owner: 10Hashar) [05:41:19] (03CR) 10Hashar: [C: 03+1] "Cherry picked on the integration puppet master and I have moved the directory." [puppet] - 10https://gerrit.wikimedia.org/r/774771 (https://phabricator.wikimedia.org/T252071) (owner: 10Hashar) [05:41:53] (03PS1) 10Ladsgroup: Fix user_email_token_expires size [software/schema-changes] - 10https://gerrit.wikimedia.org/r/774772 (https://phabricator.wikimedia.org/T298565) [05:42:01] (03CR) 10jerkins-bot: [V: 04-1] Fix user_email_token_expires size [software/schema-changes] - 10https://gerrit.wikimedia.org/r/774772 (https://phabricator.wikimedia.org/T298565) (owner: 10Ladsgroup) [05:43:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298556)', diff saved to https://phabricator.wikimedia.org/P23443 and previous config saved to /var/cache/conftool/dbconfig/20220329-054357-marostegui.json [05:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:03] (03PS2) 10Ladsgroup: Fix user_email_token_expires size [software/schema-changes] - 10https://gerrit.wikimedia.org/r/774772 (https://phabricator.wikimedia.org/T298565) [05:44:03] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [05:45:45] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:48:12] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:52:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [05:52:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [05:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23444 and previous config saved to /var/cache/conftool/dbconfig/20220329-055251-ladsgroup.json [05:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:57] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:54:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23445 and previous config saved to /var/cache/conftool/dbconfig/20220329-055458-ladsgroup.json [05:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [05:55:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [05:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23446 and previous config saved to /var/cache/conftool/dbconfig/20220329-055544-ladsgroup.json [05:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:10] (03CR) 10Ladsgroup: [C: 03+2] Fix user_email_token_expires size [software/schema-changes] - 10https://gerrit.wikimedia.org/r/774772 (https://phabricator.wikimedia.org/T298565) (owner: 10Ladsgroup) [05:56:38] (03Merged) 10jenkins-bot: Fix user_email_token_expires size [software/schema-changes] - 10https://gerrit.wikimedia.org/r/774772 (https://phabricator.wikimedia.org/T298565) (owner: 10Ladsgroup) [05:59:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P23447 and previous config saved to /var/cache/conftool/dbconfig/20220329-055902-marostegui.json [05:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T0600). [06:00:08] !log Starting s3 eqiad failover from db1157 to db1123 - T301850 [06:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:13] T301850: Switchover s3 master (db1157 -> db1123) - https://phabricator.wikimedia.org/T301850 [06:00:19] o/ [06:00:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T301850', diff saved to https://phabricator.wikimedia.org/P23448 and previous config saved to /var/cache/conftool/dbconfig/20220329-060024-marostegui.json [06:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1123 to s3 primary and set section read-write T301850', diff saved to https://phabricator.wikimedia.org/P23449 and previous config saved to /var/cache/conftool/dbconfig/20220329-060059-marostegui.json [06:01:02] all done [06:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:19] \o/ [06:01:32] do you have a time for it? how long the read-only took [06:01:59] yes, I will post in a sec in the task [06:02:00] still doing post-switchover tasks [06:02:10] awesome [06:02:11] Thanks [06:03:36] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s3 master CNAME [dns] - 10https://gerrit.wikimedia.org/r/774374 (https://phabricator.wikimedia.org/T301850) (owner: 10Marostegui) [06:05:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1157 T301850', diff saved to https://phabricator.wikimedia.org/P23450 and previous config saved to /var/cache/conftool/dbconfig/20220329-060532-root.json [06:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:38] T301850: Switchover s3 master (db1157 -> db1123) - https://phabricator.wikimedia.org/T301850 [06:09:09] (03PS1) 10Marostegui: db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/774815 [06:09:50] (03CR) 10Marostegui: [C: 03+2] db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/774815 (owner: 10Marostegui) [06:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23451 and previous config saved to /var/cache/conftool/dbconfig/20220329-061004-ladsgroup.json [06:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:11:29] !log Maintenance on db1157 (old s3 master) T301848 [06:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:34] T301848: Check for compressed templatelinks tables - https://phabricator.wikimedia.org/T301848 [06:14:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P23452 and previous config saved to /var/cache/conftool/dbconfig/20220329-061407-marostegui.json [06:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:33] !log dbmaint s3@eqiad T300381 [06:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:38] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:25:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23453 and previous config saved to /var/cache/conftool/dbconfig/20220329-062508-ladsgroup.json [06:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:53] !log dbmaint s3@eqiad T300775 [06:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:57] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23454 and previous config saved to /var/cache/conftool/dbconfig/20220329-062625-ladsgroup.json [06:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:29:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298556)', diff saved to https://phabricator.wikimedia.org/P23455 and previous config saved to /var/cache/conftool/dbconfig/20220329-062912-marostegui.json [06:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:18] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [06:33:33] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:35:09] Are pagers going off right now? Seeing 503 on enWS… [06:35:24] No [06:35:27] And Commons. [06:35:38] Yeah enwp down here [06:35:40] #page [06:35:41] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:50] Request from - via cp3052.esams.wmnet, ATS/8.0.8 [06:35:50] Error: 502, Next Hop Connection Failed at 2022-03-29 06:35:28 GMT [06:35:54] it works for me [06:36:09] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:14] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 233 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Phabricator [06:36:15] PROBLEM - PyBal backends health check on lvs3007 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3050.esams.wmnet, cp3054.esams.wmnet, cp3058.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: testlb6_443: Serve [06:36:16] 0.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3052.esams.wmnet, cp3064.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:36:19] Still nothing here marostegui [06:36:21] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:23] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled [06:36:23] 6_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:36:23] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:26] And now icinga-wm pages [06:36:29] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print [06:36:29] page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Proton [06:36:31] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:33] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:33] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:41] PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:36:42] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:42] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:45] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:45] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:36:56] PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:37:20] PROBLEM - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:37:21] PROBLEM - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:37:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [06:37:25] PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/Debmonitor [06:37:39] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:37:41] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:37:45] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:37:51] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:37:51] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:37:51] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:37:57] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 233 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:38:01] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:01] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:01] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:17] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:24] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [06:38:25] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:25] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:25] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:27] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 15.38 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [06:38:29] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:29] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:31] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 18.17 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [06:38:33] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:33] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:33] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:33] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:35] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:36] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:37] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:37] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:37] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:41] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:43] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:43] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:43] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:43] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:55] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:56] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:56] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:38:56] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:11] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:39:13] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:13] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:15] "Request from - via cp3050.esams.wmnet, ATS/8.0.8 [06:39:15] Error: 502, Next Hop Connection Failed at 2022-03-29 06:36:12 GMT" [06:39:22] ShakespeareFan00: known [06:39:25] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:39:26] Thanks. [06:39:35] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:36] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:36] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:36] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:36] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:44] RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19008 bytes in 0.541 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:39:44] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5016 is CRITICAL: 3.629e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5016 [06:39:50] (03CR) 10Phedenskog: grafana: provision JSON datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [06:39:53] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:39:53] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:39:55] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:40:01] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:40:03] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:40:05] Now responding to interactive requests again. [06:40:07] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: 3.925e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [06:40:09] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:40:09] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:40:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23456 and previous config saved to /var/cache/conftool/dbconfig/20220329-064013-ladsgroup.json [06:40:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [06:40:15] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: testlb6_443: Serve [06:40:15] 1.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:40:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [06:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:19] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 3.809e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [06:40:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:40:21] RECOVERY - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 623 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:40:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23457 and previous config saved to /var/cache/conftool/dbconfig/20220329-064021-ladsgroup.json [06:40:22] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:26] <_joe_> !log restarting varnish text-fe on cp1079 [06:40:27] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 4.435e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [06:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:31] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5005 is CRITICAL: 4.121e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [06:40:33] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5006 is CRITICAL: 4.423e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [06:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:37] RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1634 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [06:40:39] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:40:43] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5012 is CRITICAL: 4.117e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [06:40:43] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5013 is CRITICAL: 4.095e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [06:40:45] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: 4.568e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [06:40:53] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5001 is CRITICAL: 4.182e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [06:40:59] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 39622 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Phabricator [06:41:06] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:41:07] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:41:16] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19007 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:41:19] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:41:19] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:41:21] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [06:41:23] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5002 is CRITICAL: 4.789e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [06:41:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23458 and previous config saved to /var/cache/conftool/dbconfig/20220329-064130-ladsgroup.json [06:41:31] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:51] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 94.19 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [06:41:53] (JobUnavailable) firing: (6) Reduced availability for job pdu_sentry4 in eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:41:59] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:41:59] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:03] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:03] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:03] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:04] (03PS2) 10Phedenskog: grafana: provision JSON datasource [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) [06:42:07] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3052 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:07] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:07] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3056 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:11] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 6.401 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:13] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:23] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:42:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23459 and previous config saved to /var/cache/conftool/dbconfig/20220329-064229-ladsgroup.json [06:42:31] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:35] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1083.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1083.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1083.eqiad.wmnet are marked down but pooled https://wikitech.wikime [06:42:36] wiki/PyBal [06:42:41] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:43:13] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:19] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5014 is CRITICAL: 5.728e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [06:43:19] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 5.718e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [06:43:21] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:43:44] RECOVERY - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 610 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:43:44] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:44:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:44:49] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:44:51] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:45:23] PROBLEM - LVS text eqsin port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:45:29] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:45:51] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:45:53] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 9.313 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:46:05] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 1.267 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:46:13] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.530 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:46:53] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 1.271 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:47:04] RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19008 bytes in 7.934 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:47:13] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [06:47:16] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down bu [06:47:16] : textlb6_443: Servers cp5009.eqsin.wmnet, cp5008.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:47:31] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 1.267 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:47:47] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.537 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:47:48] PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:48:35] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 1.567 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:48:35] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.545 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:48:53] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.530 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:48:56] RECOVERY - LVS text eqsin port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 610 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:48:59] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.529 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:07] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:49:11] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:11] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.548 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:27] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.545 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:32] RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18994 bytes in 1.707 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:49:35] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5011 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.513 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:36] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.516 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:47] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.486 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:53] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.459 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:56] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:49:56] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.471 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:05] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:50:09] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:09] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:09] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.467 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:11] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:50:21] (JobUnavailable) firing: (6) Reduced availability for job pdu_sentry4 in eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:50:29] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.473 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:31] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5002 is OK: (C)5000 gt (W)3000 gt 544.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [06:50:31] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5016 is OK: (C)5000 gt (W)3000 gt 963.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5016 [06:50:53] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:53] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:53] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.459 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:53] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:50:53] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 944 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [06:50:54] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5014 is OK: (C)5000 gt (W)3000 gt 1204 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [06:53:17] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [06:53:28] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.466 second response time https://wikitech.wikimedia.org/wiki/Varnish [06:56:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23460 and previous config saved to /var/cache/conftool/dbconfig/20220329-065635-ladsgroup.json [06:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23461 and previous config saved to /var/cache/conftool/dbconfig/20220329-065734-ladsgroup.json [06:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:44] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 101.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:00:04] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5001 is OK: (C)5000 gt (W)3000 gt 652.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [07:00:04] Amir1, awight, Urbanecm, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:59] * kart_ is here [07:01:10] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:32] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:02:20] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 352.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [07:02:45] (03PS6) 10KartikMistry: Add viwiki eliminators to wgContentTranslationPublishRequirements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774386 (https://phabricator.wikimedia.org/T299636) (owner: 10NguoiDungKhongDinhDanh) [07:02:58] good morning [07:03:18] let me check a few things before deploying [07:03:42] (03PS1) 10Ladsgroup: Set write both for all wikis except s1 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774816 (https://phabricator.wikimedia.org/T299421) [07:04:26] (03CR) 10Ladsgroup: [C: 04-1] "s3 is not done yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774816 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [07:04:34] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 382.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [07:06:50] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:08:02] kart_: hey! there was just a now-fixed issue with the app servers, I was told to wait a few minutes before deploying the config changes just in case [07:08:57] taavi: Sure. Let's wait. [07:09:04] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.448 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:09:04] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5013 is OK: (C)5000 gt (W)3000 gt 333.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [07:09:12] PROBLEM - Check systemd state on ms-be2050 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:45] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:11:14] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:11:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23462 and previous config saved to /var/cache/conftool/dbconfig/20220329-071140-ladsgroup.json [07:11:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:11:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:11:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23463 and previous config saved to /var/cache/conftool/dbconfig/20220329-071148-ladsgroup.json [07:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:00] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2050 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:12:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23464 and previous config saved to /var/cache/conftool/dbconfig/20220329-071239-ladsgroup.json [07:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:28] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5005 is OK: (C)5000 gt (W)3000 gt 362.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [07:13:28] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 254.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [07:14:24] kart_: ok, I think we're in the clear now [07:14:45] do you want to self-service or do you want me to deploy it? [07:15:24] taavi: I can deploy :) [07:15:46] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.444 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:15:46] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5012 is OK: (C)5000 gt (W)3000 gt 216.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [07:16:12] sure, you have the only scheduled patch so feel free to just do it [07:16:46] (03CR) 10KartikMistry: [C: 03+2] Add viwiki eliminators to wgContentTranslationPublishRequirements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774386 (https://phabricator.wikimedia.org/T299636) (owner: 10NguoiDungKhongDinhDanh) [07:17:56] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:17:56] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5006 is OK: (C)5000 gt (W)3000 gt 283.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [07:18:37] (03Merged) 10jenkins-bot: Add viwiki eliminators to wgContentTranslationPublishRequirements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774386 (https://phabricator.wikimedia.org/T299636) (owner: 10NguoiDungKhongDinhDanh) [07:20:10] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:21:43] (03PS1) 10MVernon: swift::ring_manager - install_dir is managed by git::clone [puppet] - 10https://gerrit.wikimedia.org/r/774818 (https://phabricator.wikimedia.org/T265117) [07:22:24] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.465 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:22:24] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 280.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [07:23:13] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/774818 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [07:23:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:23:49] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:774386|Add viwiki eliminators to wgContentTranslationPublishRequirements (T299636)]] (duration: 00m 50s) [07:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:57] T299636: Disable ContentTranslation for non-extended confirmed users on viwiki - https://phabricator.wikimedia.org/T299636 [07:24:09] taavi: I'm done. [07:24:13] thanks! [07:24:26] !log UTC morning deploys done [07:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:31] Should I log? [07:24:34] Oh thanks :) [07:26:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23465 and previous config saved to /var/cache/conftool/dbconfig/20220329-072601-ladsgroup.json [07:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:26:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:26:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23466 and previous config saved to /var/cache/conftool/dbconfig/20220329-072744-ladsgroup.json [07:27:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:27:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:27:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23467 and previous config saved to /var/cache/conftool/dbconfig/20220329-072756-ladsgroup.json [07:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23468 and previous config saved to /var/cache/conftool/dbconfig/20220329-073004-ladsgroup.json [07:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10ayounsi) @Jclark-ctr can you sync up with me over IRC so I can give you the Junos image and config to put on a USB drive? And please remove the cable fr... [07:31:29] (03PS1) 10Marostegui: switchover-tmpl.sh: Change maintenance hour [software] - 10https://gerrit.wikimedia.org/r/774819 (https://phabricator.wikimedia.org/T303605) [07:32:13] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Change maintenance hour [software] - 10https://gerrit.wikimedia.org/r/774819 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui) [07:34:19] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host hppxetest2001.codfw.wmnet [07:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host hppxetest2001.codfw.wmnet [07:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:14] RECOVERY - Check systemd state on ms-be2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:36:25] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host hppxetest2001.codfw.wmnet with OS bullseye [07:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P23469 and previous config saved to /var/cache/conftool/dbconfig/20220329-073703-root.json [07:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:45] !log dbmaint s6@eqiad T297189 [07:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:50] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [07:41:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23470 and previous config saved to /var/cache/conftool/dbconfig/20220329-074106-ladsgroup.json [07:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:04] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2050 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:43:51] (03PS1) 10Majavah: httpbb: fix status code checks for CodeReview redirects [puppet] - 10https://gerrit.wikimedia.org/r/774821 (https://phabricator.wikimedia.org/T205361) [07:45:04] (03PS2) 10MVernon: swift::ring_manager - install_dir is managed by git::clone [puppet] - 10https://gerrit.wikimedia.org/r/774818 (https://phabricator.wikimedia.org/T265117) [07:45:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23471 and previous config saved to /var/cache/conftool/dbconfig/20220329-074509-ladsgroup.json [07:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:52] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/774818 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [07:46:10] hello, I will start running the train for 1.39.0-wmf.5 in 15 minutes [07:48:34] (03CR) 10Majavah: "thanks for deploying this! the /wiki/Special:CodeReview/... aliases work fine, but for whatever reason /w/index.php?title=CodeReview/... s" [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [07:48:46] !log dbmaint s3@eqiad T298554 [07:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:52] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [07:50:12] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:51:51] (03CR) 10Filippo Giunchedi: [C: 03+2] lists: remove double quoting for http check [puppet] - 10https://gerrit.wikimedia.org/r/774408 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [07:52:15] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove double quoting for gerrit health check [puppet] - 10https://gerrit.wikimedia.org/r/774407 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [07:53:10] (03PS3) 10Filippo Giunchedi: lists: remove double quoting for http check [puppet] - 10https://gerrit.wikimedia.org/r/774408 (https://phabricator.wikimedia.org/T304323) [07:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23472 and previous config saved to /var/cache/conftool/dbconfig/20220329-075611-ladsgroup.json [07:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:01] (03PS2) 10Filippo Giunchedi: icinga: remove double quoting for gerrit health check [puppet] - 10https://gerrit.wikimedia.org/r/774407 (https://phabricator.wikimedia.org/T304323) [08:00:05] hashar and jeena: #bothumor I οΏ½ Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T0800). [08:00:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23473 and previous config saved to /var/cache/conftool/dbconfig/20220329-080014-ladsgroup.json [08:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:10] !log dbmaint s3@eqiad T298563 [08:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:14] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [08:02:40] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host hppxetest2001.codfw.wmnet with OS bullseye [08:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:05:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:26] (03PS1) 10Hashar: testwikis wikis to 1.39.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774822 [08:06:28] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.39.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774822 (owner: 10Hashar) [08:07:04] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774822 (owner: 10Hashar) [08:07:07] !log hashar@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.5 [08:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:17] (03CR) 10MVernon: "This is one of the puppetboard errors." [puppet] - 10https://gerrit.wikimedia.org/r/774818 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:11:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23474 and previous config saved to /var/cache/conftool/dbconfig/20220329-081116-ladsgroup.json [08:11:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:11:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:11:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23475 and previous config saved to /var/cache/conftool/dbconfig/20220329-081124-ladsgroup.json [08:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:12:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:12:57] (03PS11) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for I9b40319d374143668a2666b42f59a3799d041afc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) [08:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:36] (03PS1) 10MVernon: hiera: put thanos::swift::cluster back for backends [puppet] - 10https://gerrit.wikimedia.org/r/774823 (https://phabricator.wikimedia.org/T265117) [08:14:47] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/774823 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:15:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23476 and previous config saved to /var/cache/conftool/dbconfig/20220329-081519-ladsgroup.json [08:15:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:15:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23477 and previous config saved to /var/cache/conftool/dbconfig/20220329-081527-ladsgroup.json [08:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23478 and previous config saved to /var/cache/conftool/dbconfig/20220329-081735-ladsgroup.json [08:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:19:24] (03CR) 10Filippo Giunchedi: [C: 03+1] swift::ring_manager - install_dir is managed by git::clone [puppet] - 10https://gerrit.wikimedia.org/r/774818 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:20:48] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: put thanos::swift::cluster back for backends [puppet] - 10https://gerrit.wikimedia.org/r/774823 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:21:27] (03CR) 10MVernon: [C: 03+2] swift::ring_manager - install_dir is managed by git::clone [puppet] - 10https://gerrit.wikimedia.org/r/774818 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:21:41] (03CR) 10MVernon: [C: 03+2] hiera: put thanos::swift::cluster back for backends [puppet] - 10https://gerrit.wikimedia.org/r/774823 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:23:39] (03CR) 10David Caro: [C: 03+2] paws: add haproxy routing for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/774382 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [08:23:55] (03PS1) 10Hashar: scap: make rsync use new compress algorithm [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) [08:26:41] (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [08:27:10] (03CR) 10Hashar: "The change to scap was https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/595942 I can't tell whether it makes a difference though." [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [08:32:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23479 and previous config saved to /var/cache/conftool/dbconfig/20220329-083240-ladsgroup.json [08:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:24] (03PS1) 10Filippo Giunchedi: Restore check_https_url command for api/appservers [puppet] - 10https://gerrit.wikimedia.org/r/774825 (https://phabricator.wikimedia.org/T304237) [08:41:07] !log dbmaint s3@eqiad T298557 [08:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:12] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:43:04] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005867 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:43:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23480 and previous config saved to /var/cache/conftool/dbconfig/20220329-084745-ladsgroup.json [08:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:00] (03PS1) 10Lucas Werkmeister (WMDE): Use "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774847 [08:50:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:50:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:50] (03CR) 10David Caro: "A request, and a question, all the nits can be ignored" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459 (owner: 10Arturo Borrero Gonzalez) [08:53:47] (03CR) 10David Caro: wmcs: toolforge: introduce cookbook to build/deploy all k8s components (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez) [08:55:42] (03CR) 10David Caro: wmcs: toolforge: k8s: factorize build/deplo code into a manager class (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459 (owner: 10Arturo Borrero Gonzalez) [08:56:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:32] (03Abandoned) 10Alexandros Kosiaris: (WIP): Unify kubernetes users to automate user creation [labs/private] - 10https://gerrit.wikimedia.org/r/715745 (owner: 10Alexandros Kosiaris) [08:57:29] (03PS36) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [08:58:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23481 and previous config saved to /var/cache/conftool/dbconfig/20220329-085819-ladsgroup.json [08:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:24] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:00:12] (03CR) 10Elukey: [C: 03+2] Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:00:57] (03PS1) 10Giuseppe Lavagetto: requestctl: fix interactive mode, sync [software/conftool] - 10https://gerrit.wikimedia.org/r/774849 [09:00:59] (03PS1) 10Giuseppe Lavagetto: requestctl: better error messages for inexistent references [software/conftool] - 10https://gerrit.wikimedia.org/r/774850 [09:01:01] (03PS1) 10Giuseppe Lavagetto: requestctl: force yaml rendering for actions [software/conftool] - 10https://gerrit.wikimedia.org/r/774851 [09:01:03] (03PS1) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774852 [09:02:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23482 and previous config saved to /var/cache/conftool/dbconfig/20220329-090250-ladsgroup.json [09:02:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:02:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:02:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23483 and previous config saved to /var/cache/conftool/dbconfig/20220329-090303-ladsgroup.json [09:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:03] (03CR) 10Volans: [C: 03+1] "LGTM, makse sense" [software/conftool] - 10https://gerrit.wikimedia.org/r/774849 (owner: 10Giuseppe Lavagetto) [09:04:32] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/774850 (owner: 10Giuseppe Lavagetto) [09:04:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47822 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:05:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23484 and previous config saved to /var/cache/conftool/dbconfig/20220329-090510-ladsgroup.json [09:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:05:42] (03PS1) 10Majavah: paws: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/774853 [09:07:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.312 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:07:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:07:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:07:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T297189)', diff saved to https://phabricator.wikimedia.org/P23485 and previous config saved to /var/cache/conftool/dbconfig/20220329-090737-marostegui.json [09:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:47] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [09:08:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23486 and previous config saved to /var/cache/conftool/dbconfig/20220329-090759-root.json [09:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:14] (03PS3) 10Alexandros Kosiaris: httpd: Globally enable wmfjson [puppet] - 10https://gerrit.wikimedia.org/r/572702 [09:10:52] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline" [software/conftool] - 10https://gerrit.wikimedia.org/r/774851 (owner: 10Giuseppe Lavagetto) [09:11:15] !log dbmaint s3@eqiad T298294 [09:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:20] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:11:22] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/774852 (owner: 10Giuseppe Lavagetto) [09:11:51] (03Abandoned) 10Kosta Harlan: GLAM events: add topic match mode widget selector [extensions/GrowthExperiments] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774405 (https://phabricator.wikimedia.org/T301825) (owner: 10Kosta Harlan) [09:12:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I had a quick look through all appservers, they all have more than enough disk space for this. The only ones with a bit higher disk usage " [puppet] - 10https://gerrit.wikimedia.org/r/572702 (owner: 10Alexandros Kosiaris) [09:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23487 and previous config saved to /var/cache/conftool/dbconfig/20220329-091324-ladsgroup.json [09:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:11] (03PS5) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [09:16:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Set MW_USE_CONFIG_SCHEMA constant if file exists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772937 (https://phabricator.wikimedia.org/T304460) (owner: 10Daniel Kinzler) [09:18:11] (03PS42) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [09:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23488 and previous config saved to /var/cache/conftool/dbconfig/20220329-092016-ladsgroup.json [09:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:26] (03PS1) 10David Caro: wmcs-backups: exclude integration-castor04, that vm has no disk image [puppet] - 10https://gerrit.wikimedia.org/r/774854 [09:23:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23489 and previous config saved to /var/cache/conftool/dbconfig/20220329-092303-root.json [09:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1157.eqiad.wmnet with OS bullseye [09:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:24] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.5 (duration: 77m 17s) [09:24:26] php fpm restarting [09:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:29] done! :] [09:25:58] (03PS2) 10David Caro: wmcs-backups: exclude integration-castor04, that vm has no disk image [puppet] - 10https://gerrit.wikimedia.org/r/774854 (https://phabricator.wikimedia.org/T304916) [09:27:56] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34603/console" [puppet] - 10https://gerrit.wikimedia.org/r/774543 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:28:21] !log hashar@deploy1002 Pruned MediaWiki: 1.39.0-wmf.1 (duration: 03m 49s) [09:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23490 and previous config saved to /var/cache/conftool/dbconfig/20220329-092829-ladsgroup.json [09:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:42] (03PS1) 10Hashar: group0 wikis to 1.39.0-wmf.5 refs T300204 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774856 [09:28:44] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.39.0-wmf.5 refs T300204 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774856 (owner: 10Hashar) [09:29:00] 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) The main issue I ran into is that it was said it was guaranteed by Mediawiki that no file with the sa... [09:29:24] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.5 refs T300204 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774856 (owner: 10Hashar) [09:31:00] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.5 refs T300204 [09:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:05] T300204: 1.39.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T300204 [09:31:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23491 and previous config saved to /var/cache/conftool/dbconfig/20220329-093521-ladsgroup.json [09:35:24] (03PS4) 10Alexandros Kosiaris: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:37] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [09:35:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: host reimage [09:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:46] (03CR) 10David Caro: [C: 03+2] "We should have tests for all these... maybe in another life" [puppet] - 10https://gerrit.wikimedia.org/r/774853 (owner: 10Majavah) [09:37:29] (03PS5) 10Alexandros Kosiaris: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:37:40] (03PS6) 10Alexandros Kosiaris: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:37:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:37:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:37:55] (03CR) 10Daniel Kinzler: "from IRC:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772937 (https://phabricator.wikimedia.org/T304460) (owner: 10Daniel Kinzler) [09:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23492 and previous config saved to /var/cache/conftool/dbconfig/20220329-093807-root.json [09:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: host reimage [09:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:47] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/774825 (https://phabricator.wikimedia.org/T304237) (owner: 10Filippo Giunchedi) [09:41:49] 1.39.0-wmf.5 looks mostly fine [09:41:50] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: remove conditional, always use gitlab::restore class [puppet] - 10https://gerrit.wikimedia.org/r/774543 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:42:01] I am having lunch, will continue log triage this afternoon [09:43:12] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34604/console" [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:43:27] !log depool cp2027 for reimage - T290005 [09:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:32] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:43:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23493 and previous config saved to /var/cache/conftool/dbconfig/20220329-094334-ladsgroup.json [09:43:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:43:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:43:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23494 and previous config saved to /var/cache/conftool/dbconfig/20220329-094342-ladsgroup.json [09:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) I tried to PXE boot and it didn't work, so I checked netbox and the interface listed looks weird: xe-4/0/010 https://netbox.wikimedia.org/dcim/... [09:46:51] (03PS3) 10MMandere: site: Reimage cp2027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773253 (https://phabricator.wikimedia.org/T290005) [09:47:56] (03CR) 10Jelto: "I double checked gitlab1001. Restore still is disabled after this change." [puppet] - 10https://gerrit.wikimedia.org/r/774543 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:48:46] (03CR) 10MMandere: [C: 03+2] site: Reimage cp2027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773253 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:48:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:01] (03PS2) 10Jelto: gitlab: add version check to restore script [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) [09:49:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:49:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23495 and previous config saved to /var/cache/conftool/dbconfig/20220329-095026-ladsgroup.json [09:50:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [09:50:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [09:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:50:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:50:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [09:50:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [09:50:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:50:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [09:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [09:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23496 and previous config saved to /var/cache/conftool/dbconfig/20220329-095103-ladsgroup.json [09:51:09] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS buster [09:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:19] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2027.codfw.wmnet with OS buster [09:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:04] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34605/console" [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [09:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23497 and previous config saved to /var/cache/conftool/dbconfig/20220329-095310-ladsgroup.json [09:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23498 and previous config saved to /var/cache/conftool/dbconfig/20220329-095317-root.json [09:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:33] (03PS1) 10Marostegui: Revert "db1157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/774828 [09:53:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1157.eqiad.wmnet with OS bullseye [09:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:13] (03CR) 10Filippo Giunchedi: [C: 03+2] Restore check_https_url command for api/appservers [puppet] - 10https://gerrit.wikimedia.org/r/774825 (https://phabricator.wikimedia.org/T304237) (owner: 10Filippo Giunchedi) [09:55:56] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10fgiunchedi) [09:56:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) So I have deleted the /10 interface in netbox, and renamed the /010 to /10. Now homer offers me this diff: ` Changes for 1 devices: ['asw2-c-e... [09:58:30] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add version check to restore script [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [09:58:40] (03CR) 10Marostegui: [C: 03+2] Revert "db1157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/774828 (owner: 10Marostegui) [09:59:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P23499 and previous config saved to /var/cache/conftool/dbconfig/20220329-095935-root.json [09:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ayounsi) Looks good to deploy! [10:00:59] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:01:25] (03PS43) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [10:02:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) ` elukey@asw2-c-eqiad> show interfaces descriptions xe-4/0/10 Interface Admin Link Description xe-4/0/10 up up ml-cache1002... [10:02:31] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-cache1002.eqiad.wmnet with OS bullseye [10:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:46] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [10:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:52] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache1002.eqiad.wmnet with OS bullseye [10:02:54] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34606/console" [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:48] (03CR) 10Btullis: [C: 03+2] Add an alert for zero messages being generated by varnishkafka instances [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [10:04:50] (03PS1) 10Filippo Giunchedi: sre: add 'prometheus' instance to JobUnavailable [alerts] - 10https://gerrit.wikimedia.org/r/774861 (https://phabricator.wikimedia.org/T304922) [10:04:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23500 and previous config saved to /var/cache/conftool/dbconfig/20220329-100816-ladsgroup.json [10:08:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: fix interactive mode, sync [software/conftool] - 10https://gerrit.wikimedia.org/r/774849 (owner: 10Giuseppe Lavagetto) [10:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23501 and previous config saved to /var/cache/conftool/dbconfig/20220329-100821-root.json [10:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:06] (03Merged) 10jenkins-bot: requestctl: fix interactive mode, sync [software/conftool] - 10https://gerrit.wikimedia.org/r/774849 (owner: 10Giuseppe Lavagetto) [10:10:31] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage [10:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:12:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: better error messages for inexistent references [software/conftool] - 10https://gerrit.wikimedia.org/r/774850 (owner: 10Giuseppe Lavagetto) [10:13:41] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage [10:13:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) @Cmjohnson I tried to reimage the node but I got `spicerack.dhcp.DHCPError: target file ttyS1-115200/ml-cache1002.conf exists`, I think that yo... [10:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:26] (03Merged) 10jenkins-bot: requestctl: better error messages for inexistent references [software/conftool] - 10https://gerrit.wikimedia.org/r/774850 (owner: 10Giuseppe Lavagetto) [10:14:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P23502 and previous config saved to /var/cache/conftool/dbconfig/20220329-101439-root.json [10:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T297189)', diff saved to https://phabricator.wikimedia.org/P23503 and previous config saved to /var/cache/conftool/dbconfig/20220329-101501-marostegui.json [10:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:06] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [10:15:33] (03CR) 10Giuseppe Lavagetto: requestctl: force yaml rendering for actions (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/774851 (owner: 10Giuseppe Lavagetto) [10:17:38] (03PS2) 10Giuseppe Lavagetto: requestctl: force yaml rendering for actions [software/conftool] - 10https://gerrit.wikimedia.org/r/774851 [10:17:41] (03PS2) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774852 [10:19:22] (03CR) 10jerkins-bot: [V: 04-1] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774852 (owner: 10Giuseppe Lavagetto) [10:19:24] (03CR) 10jerkins-bot: [V: 04-1] requestctl: force yaml rendering for actions [software/conftool] - 10https://gerrit.wikimedia.org/r/774851 (owner: 10Giuseppe Lavagetto) [10:23:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23504 and previous config saved to /var/cache/conftool/dbconfig/20220329-102321-ladsgroup.json [10:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:36] (03PS1) 10Majavah: openstack: fix oauth dict key [puppet] - 10https://gerrit.wikimedia.org/r/774863 (https://phabricator.wikimedia.org/T304918) [10:26:24] (03CR) 10David Caro: [C: 03+2] "Verified by manually changing the code on cloudcontrol1004 and doing a successful login." [puppet] - 10https://gerrit.wikimedia.org/r/774863 (https://phabricator.wikimedia.org/T304918) (owner: 10Majavah) [10:27:25] (03PS3) 10Giuseppe Lavagetto: requestctl: force yaml rendering for actions [software/conftool] - 10https://gerrit.wikimedia.org/r/774851 [10:27:27] (03PS3) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774852 [10:29:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: force yaml rendering for actions [software/conftool] - 10https://gerrit.wikimedia.org/r/774851 (owner: 10Giuseppe Lavagetto) [10:29:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P23505 and previous config saved to /var/cache/conftool/dbconfig/20220329-102942-root.json [10:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P23506 and previous config saved to /var/cache/conftool/dbconfig/20220329-103006-marostegui.json [10:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:38] (03Merged) 10jenkins-bot: requestctl: force yaml rendering for actions [software/conftool] - 10https://gerrit.wikimedia.org/r/774851 (owner: 10Giuseppe Lavagetto) [10:33:31] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/774377 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [10:33:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774852 (owner: 10Giuseppe Lavagetto) [10:34:28] (03PS1) 10Elukey: Add istio-cni fake token to k8s configurations [labs/private] - 10https://gerrit.wikimedia.org/r/774865 (https://phabricator.wikimedia.org/T297612) [10:35:35] (03Merged) 10jenkins-bot: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774852 (owner: 10Giuseppe Lavagetto) [10:35:41] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2027.codfw.wmnet with OS buster [10:35:45] (03CR) 10Elukey: "Adding Janis just to confirm the group "istio" in the infrastructure_users config. Does it need to correspond to something specific (a clu" [labs/private] - 10https://gerrit.wikimedia.org/r/774865 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [10:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23507 and previous config saved to /var/cache/conftool/dbconfig/20220329-103544-ladsgroup.json [10:35:51] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2027.codfw.wmnet with OS buster com... [10:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:37:30] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] "LGTM, but see inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23508 and previous config saved to /var/cache/conftool/dbconfig/20220329-103826-ladsgroup.json [10:38:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:38:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23509 and previous config saved to /var/cache/conftool/dbconfig/20220329-103834-ladsgroup.json [10:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:03] (03CR) 10David Caro: [C: 03+2] P:wmcs::paws::prometheus: fix scrape rules [puppet] - 10https://gerrit.wikimedia.org/r/774516 (owner: 10Majavah) [10:44:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P23510 and previous config saved to /var/cache/conftool/dbconfig/20220329-104446-root.json [10:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P23511 and previous config saved to /var/cache/conftool/dbconfig/20220329-104511-marostegui.json [10:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:23] (03PS4) 10Jelto: gitlab: move systemd interval for backup and restore to hiera [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) [10:46:14] (03CR) 10Volans: [C: 04-1] "I've some concerns for the query execution part, see inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/773670 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [10:47:15] (03PS5) 10Jelto: gitlab: move systemd interval for backup and restore to hiera [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) [10:50:45] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:50:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23512 and previous config saved to /var/cache/conftool/dbconfig/20220329-105050-ladsgroup.json [10:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:01] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:51:05] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34607/console" [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [10:52:55] (03CR) 10JMeybohm: [C: 04-1] Add istio-cni fake token to k8s configurations (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/774865 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [10:59:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P23513 and previous config saved to /var/cache/conftool/dbconfig/20220329-105950-root.json [10:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T297189)', diff saved to https://phabricator.wikimedia.org/P23514 and previous config saved to /var/cache/conftool/dbconfig/20220329-110016-marostegui.json [11:00:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:00:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:22] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [11:00:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T297189)', diff saved to https://phabricator.wikimedia.org/P23515 and previous config saved to /var/cache/conftool/dbconfig/20220329-110024-marostegui.json [11:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:41] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:05:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23516 and previous config saved to /var/cache/conftool/dbconfig/20220329-110555-ladsgroup.json [11:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:45] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:13:13] 10SRE, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T304927 (10kostajh) [11:13:53] 10SRE, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T304927 (10kostajh) #SRE, feel free to mark this as resolved, just want you all to be aware of that in case there is some more widespread issu... [11:14:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P23517 and previous config saved to /var/cache/conftool/dbconfig/20220329-111454-root.json [11:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23518 and previous config saved to /var/cache/conftool/dbconfig/20220329-112101-ladsgroup.json [11:21:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:21:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:21:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23519 and previous config saved to /var/cache/conftool/dbconfig/20220329-112109-ladsgroup.json [11:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:00] !log depool cp2034 for reimage - T290005 [11:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:06] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:27:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300775)', diff saved to https://phabricator.wikimedia.org/P23520 and previous config saved to /var/cache/conftool/dbconfig/20220329-112725-marostegui.json [11:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:32] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [11:27:38] (03PS2) 10Ayounsi: Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) [11:27:50] (03PS2) 10MMandere: site: Reimage cp2034 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773554 (https://phabricator.wikimedia.org/T290005) [11:28:26] (03CR) 10jerkins-bot: [V: 04-1] Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:29:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:29:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P23521 and previous config saved to /var/cache/conftool/dbconfig/20220329-112958-root.json [11:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:30:29] (03CR) 10MMandere: [C: 03+2] site: Reimage cp2034 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773554 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:30:46] (03PS3) 10Ayounsi: Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) [11:31:40] (03Abandoned) 10Ayounsi: Apply strict uRPF to the analytics vlans [homer/public] - 10https://gerrit.wikimedia.org/r/774479 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [11:32:00] (03CR) 10jerkins-bot: [V: 04-1] Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:32:32] lists.wm.org is down Amir1 legoktm [11:33:02] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp2034.codfw.wmnet with OS buster [11:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:11] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2034.codfw.wmnet with OS buster [11:37:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) @Cmjohnson there is an issue with the port assigned for **an-worker1143** on lsw1-e2-eqiad, **an-worker1145** on lsw1-f2-eqiad, and *... [11:38:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23522 and previous config saved to /var/cache/conftool/dbconfig/20220329-113849-ladsgroup.json [11:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:39:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:40:36] (03PS1) 10Lucas Werkmeister (WMDE): Use null coalescing operator [skins/Timeless] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774833 (https://phabricator.wikimedia.org/T304917) [11:41:05] I’m going for lunch now, but if anyone else wants to deploy ^ to unblock the train, feel free to [11:41:21] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 25 Jun 2022 07:55:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:23] otherwise I might do it in the upcoming backport window (a bit over 1 hour from now) [11:41:42] hashar: ^ as train conductor fyi [11:41:53] (03PS2) 10Filippo Giunchedi: logging: bump alerts logs retention [puppet] - 10https://gerrit.wikimedia.org/r/774364 (https://phabricator.wikimedia.org/T304924) [11:42:11] (03CR) 10Filippo Giunchedi: logging: bump alerts logs retention (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/774364 (https://phabricator.wikimedia.org/T304924) (owner: 10Filippo Giunchedi) [11:42:19] RhinosF1: it's up for me [11:42:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P23523 and previous config saved to /var/cache/conftool/dbconfig/20220329-114230-marostegui.json [11:42:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47822 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:35] Amir1: it just recovered [11:42:47] It alerted when I pinged you and was completely down [11:42:47] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10Zapipedia-WMF) Thank you very much for your help! The pads look perfect, and we are very happy to know that the "restorePad" feature already exists :) [11:43:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:43:39] I am walking back from school :) [11:45:27] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10MoritzMuehlenhoff) By default only "main" and "thirdparty/hwraid" (for baremetal hosts) are added to our servers. And that's by design, so that we have full control what we... [11:48:48] Lucas_WMDE: RhinosF1 I will deploy the patch. Thx for the code [11:49:09] * RhinosF1 didn't do anything [11:49:13] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: quote check_http_url_for_regexp_on_port regex argument [puppet] - 10https://gerrit.wikimedia.org/r/774377 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [11:49:18] (03PS2) 10Filippo Giunchedi: icinga: quote check_http_url_for_regexp_on_port regex argument [puppet] - 10https://gerrit.wikimedia.org/r/774377 (https://phabricator.wikimedia.org/T304323) [11:50:22] (03CR) 10Hashar: [C: 03+2] "Thank you for the patch!" [skins/Timeless] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774833 (https://phabricator.wikimedia.org/T304917) (owner: 10Lucas Werkmeister (WMDE)) [11:51:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) an-worker1142, an-worker1144, an-worker1147 and an-worker1148 should be good to go. I'm not sure why the re-image failed on those tb... [11:51:41] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2034.codfw.wmnet with reason: host reimage [11:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:38] (03Merged) 10jenkins-bot: Use null coalescing operator [skins/Timeless] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774833 (https://phabricator.wikimedia.org/T304917) (owner: 10Lucas Werkmeister (WMDE)) [11:53:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23524 and previous config saved to /var/cache/conftool/dbconfig/20220329-115354-ladsgroup.json [11:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:14] tested on mwdebug [11:56:41] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp2034.codfw.wmnet with reason: host reimage [11:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P23525 and previous config saved to /var/cache/conftool/dbconfig/20220329-115735-marostegui.json [11:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:14] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp2034 is CRITICAL: connect to address 10.192.16.184 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [12:00:21] mm [12:00:38] scap more or less blocked at: 11:56:47 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 347 host(s) [12:01:00] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp2034 is CRITICAL: connect to address 10.192.16.184 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [12:01:00] PROBLEM - puppet last run on cp2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.184: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:01:09] ^ ignore varnish alert reimaging cp2034 [12:01:48] PROBLEM - Check Varnish UDS /run/varnish-frontend-2.socket on cp2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.184: Connection reset by peer https://wikitech.wikimedia.org/wiki/Varnish [12:02:54] !log hashar@deploy1002 Synchronized php-1.39.0-wmf.5/skins/Timeless/includes/TimelessTemplate.php: Use null coalescing operator - T304917 (duration: 06m 50s) [12:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:59] T304917: PHP Notice: Undefined index: svg - https://phabricator.wikimedia.org/T304917 [12:03:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:03:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:31] (03PS4) 10Ayounsi: Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) [12:05:20] PROBLEM - Host cp2034 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:50] RECOVERY - Host cp2034 is UP: PING OK - Packet loss = 0%, RTA = 31.54 ms [12:08:32] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp2034 is OK: HTTP OK: HTTP/1.1 200 OK - 469 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:08:44] RECOVERY - Check Varnish UDS /run/varnish-frontend-2.socket on cp2034 is OK: OK: varnish UDS working as expected https://wikitech.wikimedia.org/wiki/Varnish [12:08:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23526 and previous config saved to /var/cache/conftool/dbconfig/20220329-120859-ladsgroup.json [12:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:16] 10SRE, 10Generated Data Platform, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [12:10:48] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp2034 is OK: HTTP OK: HTTP/1.1 200 OK - 468 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:12:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300775)', diff saved to https://phabricator.wikimedia.org/P23527 and previous config saved to /var/cache/conftool/dbconfig/20220329-121240-marostegui.json [12:12:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:12:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:48] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:12:48] RECOVERY - puppet last run on cp2034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:12:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T300775)', diff saved to https://phabricator.wikimedia.org/P23528 and previous config saved to /var/cache/conftool/dbconfig/20220329-121248-marostegui.json [12:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:09] 10SRE, 10Infrastructure-Foundations: Many Ganeti hosts have disk space warnings on /boot - https://phabricator.wikimedia.org/T304897 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff Thanks for opening a task, I'll take care of this. [12:16:16] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:17:51] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2034.codfw.wmnet with OS buster [12:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:58] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:18:00] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2034.codfw.wmnet with OS buster com... [12:18:26] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Test port-block constraints on QFX5120 devices - https://phabricator.wikimedia.org/T304934 (10cmooney) [12:18:48] (03CR) 10Muehlenhoff: admin: add tsepothoabala to deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [12:18:53] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Test port-block constraints on QFX5120 devices - https://phabricator.wikimedia.org/T304934 (10cmooney) p:05Triageβ†’03Medium [12:22:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23529 and previous config saved to /var/cache/conftool/dbconfig/20220329-122218-ladsgroup.json [12:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:23:20] (03CR) 10Marostegui: [C: 03+1] dbtools: Add master_finder.py [software] - 10https://gerrit.wikimedia.org/r/774585 (https://phabricator.wikimedia.org/T281249) (owner: 10Ladsgroup) [12:23:39] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) Nice!! [12:24:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23530 and previous config saved to /var/cache/conftool/dbconfig/20220329-122404-ladsgroup.json [12:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:24:30] (03CR) 10Marostegui: [C: 03+1] "At some point we should add an USAGE to it" [software] - 10https://gerrit.wikimedia.org/r/774585 (https://phabricator.wikimedia.org/T281249) (owner: 10Ladsgroup) [12:24:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23531 and previous config saved to /var/cache/conftool/dbconfig/20220329-122436-ladsgroup.json [12:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:32] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) @Ladsgroup at a second iteration we should add an USAGE to... [12:26:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23532 and previous config saved to /var/cache/conftool/dbconfig/20220329-122643-ladsgroup.json [12:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T297189)', diff saved to https://phabricator.wikimedia.org/P23533 and previous config saved to /var/cache/conftool/dbconfig/20220329-122722-marostegui.json [12:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:27] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [12:30:23] (03PS1) 10Volans: sre.hosts.downtime: increase timeout and retries [cookbooks] - 10https://gerrit.wikimedia.org/r/774880 [12:32:08] (03CR) 10Volans: "Increased as we've got few cases already in which it did timeout with:" [cookbooks] - 10https://gerrit.wikimedia.org/r/774880 (owner: 10Volans) [12:35:28] (03PS2) 10Jelto: gitlab: run backup and restore twice daily [puppet] - 10https://gerrit.wikimedia.org/r/774418 (https://phabricator.wikimedia.org/T274463) [12:36:21] (03CR) 10Jelto: [V: 03+1] gitlab: move systemd interval for backup and restore to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23534 and previous config saved to /var/cache/conftool/dbconfig/20220329-123723-ladsgroup.json [12:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:15] (03PS1) 10Volans: sre.hosts.reimage: fix message for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/774882 [12:41:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23535 and previous config saved to /var/cache/conftool/dbconfig/20220329-124148-ladsgroup.json [12:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P23536 and previous config saved to /var/cache/conftool/dbconfig/20220329-124227-marostegui.json [12:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:56] !log pool cp2034 with HAProxy as TLS termination layer - T290005 [12:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:01] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:48:28] (03CR) 10Klausman: [C: 03+1] profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [12:49:08] (03CR) 10Klausman: [C: 03+1] Add helmfile config for Istio proxy sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [12:49:52] (03CR) 10Elukey: Add istio-cni fake token to k8s configurations (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/774865 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [12:50:26] (03PS2) 10Elukey: Add istio-cni fake token to k8s configurations [labs/private] - 10https://gerrit.wikimedia.org/r/774865 (https://phabricator.wikimedia.org/T297612) [12:51:39] !log temporarily apply urpf with action: log only, on cr1-eqiad:xe-3/0/4.1118 [12:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23537 and previous config saved to /var/cache/conftool/dbconfig/20220329-125228-ladsgroup.json [12:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23538 and previous config saved to /var/cache/conftool/dbconfig/20220329-125654-ladsgroup.json [12:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P23539 and previous config saved to /var/cache/conftool/dbconfig/20220329-125733-marostegui.json [12:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:13] (03CR) 10Filippo Giunchedi: [C: 03+1] sre.hosts.downtime: increase timeout and retries [cookbooks] - 10https://gerrit.wikimedia.org/r/774880 (owner: 10Volans) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T1300). [13:00:04] Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] hello Tchanders [13:00:31] HI [13:00:44] i can deploy today :) [13:00:48] (unless you wish to deploy yourself) [13:01:32] (03PS2) 10Urbanecm: Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773517 (https://phabricator.wikimedia.org/T304604) (owner: 10Tchanders) [13:01:36] (03CR) 10Urbanecm: [C: 03+2] Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773517 (https://phabricator.wikimedia.org/T304604) (owner: 10Tchanders) [13:02:08] Urbanecm: Thanks! [13:02:16] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add istio-cni fake token to k8s configurations [labs/private] - 10https://gerrit.wikimedia.org/r/774865 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:02:17] (03Merged) 10jenkins-bot: Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773517 (https://phabricator.wikimedia.org/T304604) (owner: 10Tchanders) [13:02:53] Tchanders: do note that there's also [13:03:02] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:03:12] *that there are also files in `/usr/share/GeoIPInfo` [13:03:18] that have Enterprise in their name [13:03:29] pulled your patch to mwdebug1001 for testing :) [13:03:53] urbanecm: Hmm, Ok let me have a test... [13:04:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:16] sure, take your time [13:04:33] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: increase timeout and retries [cookbooks] - 10https://gerrit.wikimedia.org/r/774880 (owner: 10Volans) [13:05:24] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:58] (03PS1) 10Jelto: aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.9 [puppet] - 10https://gerrit.wikimedia.org/r/774884 (https://phabricator.wikimedia.org/T304622) [13:06:16] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/774423 (owner: 10Muehlenhoff) [13:06:32] (03PS6) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [13:06:34] (03PS10) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [13:06:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:06:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:17] (03Merged) 10jenkins-bot: sre.hosts.downtime: increase timeout and retries [cookbooks] - 10https://gerrit.wikimedia.org/r/774880 (owner: 10Volans) [13:07:21] (03CR) 10Filippo Giunchedi: grafana: provision JSON datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [13:07:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23540 and previous config saved to /var/cache/conftool/dbconfig/20220329-130733-ladsgroup.json [13:07:33] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34609/console" [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:07:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:07:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:07:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:07:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23541 and previous config saved to /var/cache/conftool/dbconfig/20220329-130741-ladsgroup.json [13:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:10] urbanecm: I'm not seeing the data. Have we got time to try the other URL if I make a patch now? [13:08:17] Tchanders: sure thing [13:08:33] (03CR) 10Btullis: [C: 03+1] "Yep, seems fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/773232 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:10:10] !log roolback: temporarily apply urpf with action: log only, on cr1-eqiad:xe-3/0/4.1118 [13:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/774884 (https://phabricator.wikimedia.org/T304622) (owner: 10Jelto) [13:11:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10ayounsi) I pushed the following temporarily and confirmed that no traffic is hitting the filter: `name=cr... [13:11:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23542 and previous config saved to /var/cache/conftool/dbconfig/20220329-131159-ladsgroup.json [13:12:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [13:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [13:12:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [13:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [13:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:17] (03PS1) 10Tchanders: Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774885 (https://phabricator.wikimedia.org/T304604) [13:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:12:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:12:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:12:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:12:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T297189)', diff saved to https://phabricator.wikimedia.org/P23543 and previous config saved to /var/cache/conftool/dbconfig/20220329-131238-marostegui.json [13:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:12:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:12:43] urbanecm: I made a new commit on top - was that right? [13:12:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:12:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23544 and previous config saved to /var/cache/conftool/dbconfig/20220329-131246-ladsgroup.json [13:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T297189)', diff saved to https://phabricator.wikimedia.org/P23545 and previous config saved to /var/cache/conftool/dbconfig/20220329-131251-marostegui.json [13:12:51] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [13:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:58] Tchanders: yeah. can you link it to me please? [13:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:04] (and put to the calendar, too :)) [13:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:13:18] urbanecm: sorry, here: https://gerrit.wikimedia.org/r/774885 [13:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:28] thx [13:13:37] (03CR) 10Urbanecm: [C: 03+2] Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774885 (https://phabricator.wikimedia.org/T304604) (owner: 10Tchanders) [13:14:21] (03Merged) 10jenkins-bot: Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774885 (https://phabricator.wikimedia.org/T304604) (owner: 10Tchanders) [13:14:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23546 and previous config saved to /var/cache/conftool/dbconfig/20220329-131453-ladsgroup.json [13:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:26] (03PS2) 10Btullis: Use test coordinator for staging datahub deploy [puppet] - 10https://gerrit.wikimedia.org/r/774458 (https://phabricator.wikimedia.org/T301459) [13:15:43] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/774458 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis) [13:16:15] Tchanders: pulled to mwdebug1001, can you check? [13:16:26] Having a look [13:17:04] urbanecm: Working, thanks a lot! [13:17:11] syncing both changes [13:18:25] (03CR) 10Btullis: [C: 03+2] Use test coordinator for staging datahub deploy [puppet] - 10https://gerrit.wikimedia.org/r/774458 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis) [13:18:39] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: d632476: 64226d7: Set IPInfo config for path to MaxMind files (T304604) (duration: 00m 54s) [13:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:45] Tchanders: should be live [13:18:46] T304604: Set config for path to MaxMind files on production - https://phabricator.wikimedia.org/T304604 [13:18:47] anything else? [13:19:05] 10SRE, 10Generated Data Platform, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [13:19:06] urbanecm: Wonderful, thanks! Nothing else from me [13:19:12] okay :) [13:19:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:24] !log UTC afternoon B&C window done [13:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:10] jouncebot: nowandnext [13:23:10] For the next 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T1300) [13:23:10] In 2 hour(s) and 36 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T1600) [13:23:23] is the floor clean urbanecm ? [13:23:31] Amir1: yes, go ahead [13:23:44] awesome [13:23:52] (03CR) 10Ladsgroup: [C: 03+2] Set write both for all wikis except s1 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774816 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [13:23:56] (03PS2) 10Ladsgroup: Set write both for all wikis except s1 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774816 (https://phabricator.wikimedia.org/T299421) [13:24:02] (03CR) 10Ladsgroup: [C: 03+2] "..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774816 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [13:24:26] (03PS5) 10JMeybohm: Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 [13:24:45] (03Merged) 10jenkins-bot: Set write both for all wikis except s1 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774816 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [13:25:06] marostegui: heads up, I'm turning on templatelinks write both on all wikis except s1 and s4 [13:25:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:32] (03PS5) 10Elukey: Add helmfile config for Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) [13:26:34] behold [13:27:22] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:774816|Set write both for all wikis except s1 and s4 (T299421)]] (duration: 00m 55s) [13:27:26] (03CR) 10Elukey: Add helmfile config for Istio proxy sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:28] T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421 [13:28:15] 10SRE, 10SRE Observability, 10Wikimedia-Mailing-lists: lists1001 - Icinga CRIT alerts - https://phabricator.wikimedia.org/T304886 (10lmata) [13:29:07] (03PS5) 10Volans: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [13:29:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23547 and previous config saved to /var/cache/conftool/dbconfig/20220329-132959-ladsgroup.json [13:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:16] (03CR) 10Ayounsi: "Example diff on cr1-eqiad:" [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [13:30:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:05] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10Volans) As John is out I took a stab at the implementation in https://gerrit.wikimedia.or... [13:34:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:34:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:06] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! One small nit inline regarding the schema description but config is good :)" [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [13:38:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:21] (03CR) 10JMeybohm: [C: 03+1] "Nice, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/774861 (https://phabricator.wikimedia.org/T304922) (owner: 10Filippo Giunchedi) [13:39:35] (03CR) 10Volans: "As John is out I took a stab at this, and decided to convert it to python, as I think gives us more flexibility." [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [13:42:37] (03CR) 10JMeybohm: profile::calico::kubernetes: add optional istio-cni config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:43:34] (03CR) 10Jaime Nuche: [C: 03+1] "What I understand from the message of the first commit for this file, is that the motivation for this script is to ensure that files' meta" [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [13:45:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23548 and previous config saved to /var/cache/conftool/dbconfig/20220329-134504-ladsgroup.json [13:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:51] (03CR) 10Elukey: [C: 03+2] Add helmfile config for Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:53:46] 10SRE, 10Generated Data Platform, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [13:55:01] (03CR) 10Elukey: [V: 03+1] profile::calico::kubernetes: add optional istio-cni config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:56:02] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:59:14] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:50] (03CR) 10Giuseppe Lavagetto: "Simple questions:" [puppet] - 10https://gerrit.wikimedia.org/r/572702 (owner: 10Alexandros Kosiaris) [13:59:54] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:00:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23549 and previous config saved to /var/cache/conftool/dbconfig/20220329-140009-ladsgroup.json [14:00:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:00:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:15] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:00:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23550 and previous config saved to /var/cache/conftool/dbconfig/20220329-140017-ladsgroup.json [14:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23551 and previous config saved to /var/cache/conftool/dbconfig/20220329-140224-ladsgroup.json [14:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] httpd: Globally enable wmfjson (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572702 (owner: 10Alexandros Kosiaris) [14:03:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23552 and previous config saved to /var/cache/conftool/dbconfig/20220329-140320-ladsgroup.json [14:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:30] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:13:03] (03CR) 10JMeybohm: profile::calico::kubernetes: add optional istio-cni config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [14:13:12] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:14:44] RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:59] (03PS13) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [14:16:20] (03CR) 10Elukey: [V: 03+1] profile::calico::kubernetes: add optional istio-cni config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [14:17:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23553 and previous config saved to /var/cache/conftool/dbconfig/20220329-141729-ladsgroup.json [14:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:53] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:18:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P23554 and previous config saved to /var/cache/conftool/dbconfig/20220329-141825-ladsgroup.json [14:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpd: Globally enable wmfjson [puppet] - 10https://gerrit.wikimedia.org/r/572702 (owner: 10Alexandros Kosiaris) [14:21:18] (03PS1) 10Giuseppe Lavagetto: fetch_external_cloud_vendors: update for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774895 [14:24:15] (03PS2) 10Giuseppe Lavagetto: fetch_external_cloud_vendors: update for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774895 [14:25:55] (03PS7) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [14:25:57] (03PS11) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [14:25:59] (03PS1) 10Elukey: k8s::kubeconfig: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/774896 [14:27:13] (03CR) 10jerkins-bot: [V: 04-1] k8s::kubeconfig: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/774896 (owner: 10Elukey) [14:27:14] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:28:27] (03PS8) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [14:28:29] (03PS12) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [14:29:34] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:31:48] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:32:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23555 and previous config saved to /var/cache/conftool/dbconfig/20220329-143234-ladsgroup.json [14:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P23556 and previous config saved to /var/cache/conftool/dbconfig/20220329-143330-ladsgroup.json [14:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:35] (03PS2) 10Elukey: k8s::kubeconfig: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/774896 [14:40:37] (03PS9) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [14:40:39] (03PS13) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [14:41:11] (03CR) 10jerkins-bot: [V: 04-1] k8s::kubeconfig: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/774896 (owner: 10Elukey) [14:41:15] (03PS3) 10Giuseppe Lavagetto: fetch_external_cloud_vendors: update for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774895 [14:47:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23557 and previous config saved to /var/cache/conftool/dbconfig/20220329-144739-ladsgroup.json [14:47:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P23558 and previous config saved to /var/cache/conftool/dbconfig/20220329-144740-marostegui.json [14:47:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:47:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:47:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23559 and previous config saved to /var/cache/conftool/dbconfig/20220329-144747-ladsgroup.json [14:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:57] (03PS3) 10Elukey: k8s::kubeconfig: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/774896 [14:47:59] (03PS10) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [14:48:01] (03PS14) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [14:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23560 and previous config saved to /var/cache/conftool/dbconfig/20220329-144835-ladsgroup.json [14:48:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:48:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:48:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23561 and previous config saved to /var/cache/conftool/dbconfig/20220329-144848-ladsgroup.json [14:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23562 and previous config saved to /var/cache/conftool/dbconfig/20220329-144854-ladsgroup.json [14:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:15] (03CR) 10jerkins-bot: [V: 04-1] profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [14:49:41] (03CR) 10Ahmon Dancy: scap: make rsync use new compress algorithm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [14:50:45] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:08] (03CR) 10Volans: "some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/774895 (owner: 10Giuseppe Lavagetto) [14:51:21] (03CR) 10Volans: fetch_external_cloud_vendors: update for requestctl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774895 (owner: 10Giuseppe Lavagetto) [14:52:28] (03CR) 10Ahmon Dancy: "Is there a plan for monitoring the effect?" [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [14:52:30] (03PS11) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [14:52:32] (03PS15) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [14:52:34] (03CR) 10Ahmon Dancy: [C: 03+1] scap: make rsync use new compress algorithm [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [14:53:20] (03CR) 10jerkins-bot: [V: 04-1] profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [14:53:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34614/console" [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [14:56:05] (03PS12) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [14:56:07] (03PS16) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [14:56:53] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add 'prometheus' instance to JobUnavailable [alerts] - 10https://gerrit.wikimedia.org/r/774861 (https://phabricator.wikimedia.org/T304922) (owner: 10Filippo Giunchedi) [14:57:01] (03PS2) 10Filippo Giunchedi: sre: add 'prometheus' instance to JobUnavailable [alerts] - 10https://gerrit.wikimedia.org/r/774861 (https://phabricator.wikimedia.org/T304922) [14:57:53] (03CR) 10Filippo Giunchedi: [C: 03+1] puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:58:12] (03CR) 10Giuseppe Lavagetto: fetch_external_cloud_vendors: update for requestctl (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/774895 (owner: 10Giuseppe Lavagetto) [14:58:40] (03CR) 10MVernon: [C: 03+2] puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:00:07] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), and 2 others: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10fgiunchedi) 05Openβ†’03Resolved a:03fgiunchedi I believe with the latest patches merged all `check_http` urls are quoted now, tentatively... [15:02:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T297189)', diff saved to https://phabricator.wikimedia.org/P23563 and previous config saved to /var/cache/conftool/dbconfig/20220329-150245-marostegui.json [15:02:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:02:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:51] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [15:02:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T297189)', diff saved to https://phabricator.wikimedia.org/P23564 and previous config saved to /var/cache/conftool/dbconfig/20220329-150253-marostegui.json [15:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] (03PS4) 10Giuseppe Lavagetto: fetch_external_cloud_vendors: update for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774895 [15:04:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23565 and previous config saved to /var/cache/conftool/dbconfig/20220329-150359-ladsgroup.json [15:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:09] (03CR) 10AOkoth: [C: 03+2] aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.9 [puppet] - 10https://gerrit.wikimedia.org/r/774884 (https://phabricator.wikimedia.org/T304622) (owner: 10Jelto) [15:08:42] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/774895 (owner: 10Giuseppe Lavagetto) [15:09:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23566 and previous config saved to /var/cache/conftool/dbconfig/20220329-150900-ladsgroup.json [15:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:09:39] (03CR) 10Muehlenhoff: [C: 03+2] certspotter: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/773780 (owner: 10Muehlenhoff) [15:10:45] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:15:48] (03PS1) 10Tchanders: Remove wgWMEIPAddressCopyActionEnabled from Beta and production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774904 (https://phabricator.wikimedia.org/T296469) [15:19:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23567 and previous config saved to /var/cache/conftool/dbconfig/20220329-151905-ladsgroup.json [15:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:34] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [15:19:34] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [15:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:32] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [15:20:33] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [15:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:07] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [15:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:20] (03CR) 10JMeybohm: [C: 04-1] Add helm charts and a helmfile configuration for datahub (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [15:24:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P23568 and previous config saved to /var/cache/conftool/dbconfig/20220329-152405-ladsgroup.json [15:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:37] (03PS1) 10AOkoth: aptrepo: update gitlab-ce & gitlab-runner to 14.9 [puppet] - 10https://gerrit.wikimedia.org/r/774905 (https://phabricator.wikimedia.org/T304622) [15:26:51] (03CR) 10JMeybohm: [C: 03+1] k8s::kubeconfig: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/774896 (owner: 10Elukey) [15:27:15] (03CR) 10Jelto: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 14.9 [puppet] - 10https://gerrit.wikimedia.org/r/774905 (https://phabricator.wikimedia.org/T304622) (owner: 10AOkoth) [15:27:26] (03CR) 10AOkoth: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 14.9 [puppet] - 10https://gerrit.wikimedia.org/r/774905 (https://phabricator.wikimedia.org/T304622) (owner: 10AOkoth) [15:34:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] fetch_external_cloud_vendors: update for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774895 (owner: 10Giuseppe Lavagetto) [15:34:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23569 and previous config saved to /var/cache/conftool/dbconfig/20220329-153410-ladsgroup.json [15:34:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:34:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:34:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:34:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23570 and previous config saved to /var/cache/conftool/dbconfig/20220329-153423-ladsgroup.json [15:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:31] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23571 and previous config saved to /var/cache/conftool/dbconfig/20220329-153630-ladsgroup.json [15:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P23572 and previous config saved to /var/cache/conftool/dbconfig/20220329-153910-ladsgroup.json [15:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:03] (03CR) 10Razzi: [V: 03+1 C: 03+2] kafka: allow access to jumbo from karapace1001 [puppet] - 10https://gerrit.wikimedia.org/r/774538 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:41:11] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:13] (03CR) 10Ottomata: [C: 03+1] kafka: allow access to jumbo from karapace1001 [puppet] - 10https://gerrit.wikimedia.org/r/774538 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:43:00] !log imported scap 4.5.0 to strets-/buster-/bullseye-wikimedia - T304134 [15:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:05] T304134: Deploy Scap version 4.5.0 - https://phabricator.wikimedia.org/T304134 [15:46:20] !log updated scap to 4.5.0 on canary hosts - T304134 [15:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:13] !log jayme@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [15:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:31] !log jayme@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 18s) [15:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:53] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10LWyatt) >>! In T297968#7807725, @awight wrote: > Awkwardly, I went to bbcrewind.co.uk to get an idea of whether they're running MediaWiki and generally how they plan to host Kartotherian-backed maps, but... [15:49:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T297189)', diff saved to https://phabricator.wikimedia.org/P23573 and previous config saved to /var/cache/conftool/dbconfig/20220329-154941-marostegui.json [15:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:48] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [15:51:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23574 and previous config saved to /var/cache/conftool/dbconfig/20220329-155135-ladsgroup.json [15:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:26] (03PS2) 10Razzi: Add superset-next domain CNAME [dns] - 10https://gerrit.wikimedia.org/r/774537 (https://phabricator.wikimedia.org/T275575) [15:54:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23575 and previous config saved to /var/cache/conftool/dbconfig/20220329-155415-ladsgroup.json [15:54:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [15:54:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [15:54:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [15:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [15:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:03:08] (03PS3) 10RhinosF1: sallogger: send to #wikimedia-cloud-feed instead [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773309 [16:04:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P23576 and previous config saved to /var/cache/conftool/dbconfig/20220329-160446-marostegui.json [16:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:29] (03PS7) 10JMeybohm: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) [16:06:31] (03PS1) 10JMeybohm: Move miscweb back to state monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/774916 (https://phabricator.wikimedia.org/T290966) [16:06:33] (03PS1) 10JMeybohm: Move miscweb back to state production [puppet] - 10https://gerrit.wikimedia.org/r/774917 (https://phabricator.wikimedia.org/T290966) [16:06:39] (03PS44) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [16:06:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23577 and previous config saved to /var/cache/conftool/dbconfig/20220329-160640-ladsgroup.json [16:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:51] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:07:37] (03CR) 10JMeybohm: Remove LVS for miscweb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:15:41] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:17:30] !log aokoth@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM gitlab2001.wikimedia.org [16:17:32] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM gitlab2001.wikimedia.org [16:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:31] !log aokoth@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM gitlab2001.wikimedia.org [16:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P23578 and previous config saved to /var/cache/conftool/dbconfig/20220329-161950-marostegui.json [16:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:29] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [16:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23579 and previous config saved to /var/cache/conftool/dbconfig/20220329-162146-ladsgroup.json [16:21:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:21:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:21:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23580 and previous config saved to /var/cache/conftool/dbconfig/20220329-162153-ladsgroup.json [16:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23581 and previous config saved to /var/cache/conftool/dbconfig/20220329-162401-ladsgroup.json [16:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:21] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:25:41] (03CR) 10Hashar: [C: 03+2] "Lets give that a try :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/774769 (owner: 10PipelineBot) [16:26:13] (03CR) 10Andrew Bogott: [C: 03+1] "LGTM, are we ready to roll this out and test?" [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [16:27:46] /srv/deployment-charts [16:28:01] error: cannot open .git/FETCH_HEAD: Permission denied [16:28:04] root:root [16:28:28] https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments says that a cron job updates the tree [16:28:40] OH THAT WAS IT [16:28:53] CD == Cron Delivery [16:28:59] haha [16:29:27] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/774769 (owner: 10PipelineBot) [16:30:07] hashar: better than STD (systemd-timer delivery!) [16:30:13] lol [16:30:17] AHAH [16:30:21] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:30:29] so yeah something did a pull [16:30:37] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:31:11] (03CR) 10Dzahn: [C: 03+1] gitlab: move systemd interval for backup and restore to hiera [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [16:31:21] (03PS1) 10Andrew Bogott: WMCS: replace a few stray URLS that weren't using the openstack server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/774922 (https://phabricator.wikimedia.org/T256144) [16:31:30] ./deploy.sh: line 62: pushd: /srv/deployment-charts/helmfile.d/services/staging/blubberoid: No such file or directory [16:31:38] progress :) [16:31:59] (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:32:10] `/srv/deployment-charts/helmfile.d/services/blubberoid` is the dir you want [16:33:07] (03CR) 10Elukey: [C: 03+2] k8s::kubeconfig: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/774896 (owner: 10Elukey) [16:34:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T297189)', diff saved to https://phabricator.wikimedia.org/P23582 and previous config saved to /var/cache/conftool/dbconfig/20220329-163455-marostegui.json [16:34:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:34:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:02] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [16:35:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T297189)', diff saved to https://phabricator.wikimedia.org/P23583 and previous config saved to /var/cache/conftool/dbconfig/20220329-163503-marostegui.json [16:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:18] the script tries to cd to /srv/deployment-charts/helmfile.d/services/${env}/${SERVICE_NAME} [16:35:45] What script? I didn't see where deploy.sh comes from [16:36:09] I was trying the deploy.sh script at in /srv/deployment-charts [16:36:21] but it is making different assumptions apparently [16:36:38] https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments step 7 is the one-liner you need. [16:36:49] OH [16:37:50] !log hashar@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [16:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:59] (03PS13) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [16:38:01] (03PS17) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [16:38:16] !log hashar@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [16:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:30] !log hashar@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [16:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:38:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34615/console" [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [16:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:00] !log hashar@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [16:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23584 and previous config saved to /var/cache/conftool/dbconfig/20220329-163906-ladsgroup.json [16:39:07] !log hashar@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [16:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:35] !log hashar@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [16:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:40] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10Dzahn) :) nice, thanks for confirming [16:39:47] dancy: looks like I did my first helm based deployment [16:39:51] PROBLEM - Host gitlab2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:55] Nice work! [16:41:41] I'm taking a break to celebrate [16:41:47] (03PS1) 10Volans: pylint: fix newly reported issue [software/spicerack] - 10https://gerrit.wikimedia.org/r/774925 [16:41:49] (03PS1) 10Volans: ipmi: add remove_boot_override, improve force_pxe [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) [16:42:09] success [16:42:15] dancy: thank you for your assistance. Solved! [16:42:38] (03PS1) 10Volans: sre.hosts.reimage: call Ipmi.remove_boot_override [cookbooks] - 10https://gerrit.wikimedia.org/r/774927 (https://phabricator.wikimedia.org/T304434) [16:44:59] gitlab2001 down is known maintenance [16:45:13] arnoldokoth: is it rebooting normal? [16:45:46] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2001.codfw.wmnet [16:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:12] good. I am out now :] [16:51:50] (03CR) 10Volans: [C: 03+2] "trivial, self-merging to unblock other patches from failing CI" [software/spicerack] - 10https://gerrit.wikimedia.org/r/774925 (owner: 10Volans) [16:51:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2001.codfw.wmnet [16:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23586 and previous config saved to /var/cache/conftool/dbconfig/20220329-165411-ladsgroup.json [16:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:34] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2002.codfw.wmnet [16:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:33] (03PS1) 10Jdlrobson: End migration mode [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774839 (https://phabricator.wikimedia.org/T301930) [16:59:38] (03Merged) 10jenkins-bot: pylint: fix newly reported issue [software/spicerack] - 10https://gerrit.wikimedia.org/r/774925 (owner: 10Volans) [16:59:46] (03PS1) 10Jdlrobson: Restore the classes skin-vector and skin-vector-search-vue to body [skins/Vector] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774840 [17:00:13] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM gitlab2001.wikimedia.org [17:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:54] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [17:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2002.codfw.wmnet [17:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:48] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2003.codfw.wmnet [17:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:29] mutante: Nope, reboot is not going well it seems. [17:04:32] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [17:04:32] !log klausman@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-staging2001.codfw.wmnet [17:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:06] (03PS2) 10Jdlrobson: End migration mode [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774839 (https://phabricator.wikimedia.org/T301930) [17:08:58] Is there someone I can poke about getting a word added to fancycaptcha/badwords on mwmaint1002? (https://wikitech.wikimedia.org/wiki/Generating_CAPTCHAs) [17:09:15] This is DEFINITELY not so I can be lazy and not file a ticket [17:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23588 and previous config saved to /var/cache/conftool/dbconfig/20220329-170916-ladsgroup.json [17:09:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:09:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:09:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23589 and previous config saved to /var/cache/conftool/dbconfig/20220329-170924-ladsgroup.json [17:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:56] perryprog: umh what do you need? [17:11:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2003.codfw.wmnet [17:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:40] Someone on VRT got a captcha with "aryan" in it which while not really explicit, also doesn't really need to be in a captcha [17:11:48] Someone who emailed VRT* [17:12:35] mind if i move this to pms? [17:12:37] sure [17:13:38] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-tool1005.eqiad.wmnet with reason: Testing deploy of superset 1.4.2 to staging [17:13:39] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-tool1005.eqiad.wmnet with reason: Testing deploy of superset 1.4.2 to staging [17:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:24] I said earlier gitlab2001 is expected maintenance.. well.. the reboot was [17:17:30] but not the part that it did not come back from it [17:18:08] and the cookbook said it failed to get status.. though I see it in console.. so it's probably another one of those cases where numbering of NICs changed [17:22:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [17:22:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [17:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:06] !log gitlab2001 - did not come back from reboot via cookbook. logged in via console. then "s/ens5/ens13" in /etc/network/interfaces ; reboot ; issue was like T272555 and others [17:23:07] RECOVERY - Host gitlab2001 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [17:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:11] T272555: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 [17:23:35] (03CR) 10Subramanya Sastry: [C: 03+1] Add wikimedia.com to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555) (owner: 10Arlolra) [17:25:26] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:26:00] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/774821 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [17:28:41] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:27] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Krinkle) It seems the redirect isn't working for the `r` URLs: !log gitlab2001 - systemctl reset-failed [17:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:51] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:50] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [17:34:05] PROBLEM - Host wdqs2005 is DOWN: PING CRITICAL - Packet loss = 100% [17:34:15] RECOVERY - Host wdqs2005 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [17:35:33] perryprog: done -- change will be live within 30 minutes [17:35:40] and thanks for the report! [17:35:48] Thanks for the help o7 [17:36:50] hmmm, I might have lied -- the change to the *word list* will be live, but I'm not sure if we need to do anything extra to regenerate images [17:37:11] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Hue [puppet] - 10https://gerrit.wikimedia.org/r/773232 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:37:48] rzl I think it happens on cron as of T150029 [17:37:48] T150029: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029 [17:37:54] IIRC we generate a batch at a time, and just make new ones when we run out, so-- or that too :) [17:38:56] i'd imagine it's Good Enough to just let the system run out of the current images, given the size of the word list I don't think it's going to be a very common word in them [17:39:02] let's see, that job runs every week -- I can restart it but I'm inclined to just leave it [17:39:03] yeah [17:39:28] you are right, that is a maintenance "cron" (timer) [17:39:34] if needed we could run it manually [17:39:34] I assume the cronjob wipes the whole set every week since delete-on-solve isn't on [17:39:46] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Krinkle) >>! In T205361#7624441, @gerritbot wrote: > Change 754088 **merged... [17:40:02] oh, so others are going to get the same image until we regenerate them? I didn't realize that, it changes the landscape a bit [17:40:11] I thiiiink so [17:40:35] (this may or may not be hard to believe, but I haven't spent much time in this system before) [17:41:14] It'd be nice to have $wgCaptchaDeleteOnSolve enabled but getting captcha generation "lined up" with the rate that captchas are being solved sounds a bit Scary. (T150049) [17:41:15] T150049: Enable $wgCaptchaDeleteOnSolve - https://phabricator.wikimedia.org/T150049 [17:41:48] Plus pip is slow so I feel like doing one-offs for each solve isn't good [17:41:55] er, pillow, not pip [17:42:23] https://phabricator.wikimedia.org/T150029#2772632 [17:42:35] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ConfirmEdit/+/322735/ [17:42:39] ^ here the --delete option was added [17:43:07] that was all in 2016 but I assume it's still all valid what was said there [17:43:07] (03PS1) 10Muehlenhoff: imagecatalog: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/774939 [17:43:40] sorry, I have to go afk but running the maintenance cron should be safe as long as it's the same command the timer runs [17:43:53] okay cool, I was coming around to the same conclusion [17:44:24] (03PS1) 10Reedy: captchaloop: Replace deprecated blacklist parameter [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) [17:44:30] I'm just hesitant because if it *does* turn out to leave us in an unexpected state, I'd rather have someone around who knows how to get back out of it :) [17:45:15] (03CR) 10Reedy: [C: 04-1] ""not yet"... We need https://gerrit.wikimedia.org/r/680999 in WMF production on all supported MW branches" [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [17:45:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T297189)', diff saved to https://phabricator.wikimedia.org/P23590 and previous config saved to /var/cache/conftool/dbconfig/20220329-174518-marostegui.json [17:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:26] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [17:45:55] R-eedy might but they might be busy :P [17:45:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/774939 (owner: 10Muehlenhoff) [17:47:31] !log installing tiff security updates [17:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:02] (03PS1) 10Majavah: mediawiki: fix r123 syntax for special:codereview redirects [puppet] - 10https://gerrit.wikimedia.org/r/774943 (https://phabricator.wikimedia.org/T205361) [17:53:36] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Majavah) >>! In T205361#7815503, @Krinkle wrote: > It seems the result of t... [18:00:05] hashar and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T1800). [18:00:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P23591 and previous config saved to /var/cache/conftool/dbconfig/20220329-180023-marostegui.json [18:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:56] !log restarting fpm on mw canaries to pick up new libtiff [18:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:05:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:52] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Krinkle) >>! In T205361#7815521, @Majavah wrote: >>>! In T205361#7815503, @... [18:09:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23592 and previous config saved to /var/cache/conftool/dbconfig/20220329-180938-ladsgroup.json [18:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:44] (03CR) 10Krinkle: mediawiki: fix r123 syntax for special:codereview redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774943 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [18:09:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:10:30] (03CR) 10Majavah: mediawiki: fix r123 syntax for special:codereview redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774943 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [18:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:15:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P23593 and previous config saved to /var/cache/conftool/dbconfig/20220329-181529-marostegui.json [18:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23594 and previous config saved to /var/cache/conftool/dbconfig/20220329-182444-ladsgroup.json [18:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T297189)', diff saved to https://phabricator.wikimedia.org/P23595 and previous config saved to /var/cache/conftool/dbconfig/20220329-183034-marostegui.json [18:30:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [18:30:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [18:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:39] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [18:30:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T297189)', diff saved to https://phabricator.wikimedia.org/P23596 and previous config saved to /var/cache/conftool/dbconfig/20220329-183041-marostegui.json [18:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300775)', diff saved to https://phabricator.wikimedia.org/P23597 and previous config saved to /var/cache/conftool/dbconfig/20220329-183215-marostegui.json [18:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:21] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [18:33:08] (03PS1) 10Jsn.sherman: Add surveys to enwiki on beta for QA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774948 (https://phabricator.wikimedia.org/T294363) [18:34:18] (03CR) 10Jsn.sherman: "Hey Essex, I'd really appreciate your eyes on this since you've done so many of these!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774948 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [18:35:31] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23598 and previous config saved to /var/cache/conftool/dbconfig/20220329-183949-ladsgroup.json [18:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:03] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:33] (03CR) 10Herron: [C: 03+1] logging: bump alerts logs retention (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774364 (https://phabricator.wikimedia.org/T304924) (owner: 10Filippo Giunchedi) [18:45:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:45:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P23599 and previous config saved to /var/cache/conftool/dbconfig/20220329-184720-marostegui.json [18:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:42] (03PS2) 10Jsn.sherman: Add surveys to enwiki on beta for QA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774948 (https://phabricator.wikimedia.org/T294363) [18:54:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23600 and previous config saved to /var/cache/conftool/dbconfig/20220329-185454-ladsgroup.json [18:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:55:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:55:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23601 and previous config saved to /var/cache/conftool/dbconfig/20220329-185526-ladsgroup.json [18:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:58] (03CR) 10Bearloga: [C: 03+1] Config for new android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773896 (owner: 10Sharvaniharan) [18:57:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23602 and previous config saved to /var/cache/conftool/dbconfig/20220329-185733-ladsgroup.json [18:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1001.wikimedia.org [18:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:17] (03PS2) 10Andrew Bogott: WMCS: replace a few stray URLS that weren't using the openstack server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/774922 (https://phabricator.wikimedia.org/T256144) [18:59:19] (03PS1) 10Andrew Bogott: wmcs-novastats-proxyleaks.py: Improve handling of proxies under wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/774951 [18:59:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1001.wikimedia.org [19:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2001.wikimedia.org [19:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:13] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-proxyleaks.py: Improve handling of proxies under wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/774951 (owner: 10Andrew Bogott) [19:01:21] (03CR) 10Andrew Bogott: [C: 03+2] WMCS: replace a few stray URLS that weren't using the openstack server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/774922 (https://phabricator.wikimedia.org/T256144) (owner: 10Andrew Bogott) [19:02:06] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 95 probes of 673 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:02:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P23603 and previous config saved to /var/cache/conftool/dbconfig/20220329-190226-marostegui.json [19:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:36] (03PS3) 10Andrew Bogott: dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [19:04:17] (03PS1) 10Eigyan: [config]: Deploy gdi-safety-survey to ES,EN,FR and PT wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774953 [19:04:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2001.wikimedia.org [19:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:13] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 65 probes of 756 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:10:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. I'm currently looking into a backport of python-cachelib for bullseye-wikimedia and when that's done I'll merge your pat" [puppet] - 10https://gerrit.wikimedia.org/r/774512 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [19:10:45] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:12:02] (03PS2) 10Herron: admin: add tsepothoabala to deployment [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [19:12:23] (03CR) 10jerkins-bot: [V: 04-1] admin: add tsepothoabala to deployment [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [19:12:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23604 and previous config saved to /var/cache/conftool/dbconfig/20220329-191238-ladsgroup.json [19:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:00] !log mforns@deploy1002 Started deploy [analytics/refinery@8e9f97c]: Regular analytics weekly train [analytics/refinery@8e9f97c] [19:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:13] PROBLEM - Host wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:15:15] RECOVERY - Host wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:15:27] (03CR) 10Eigyan: [C: 03+1] "DIFF looks good; LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774948 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [19:15:55] (03CR) 10RLazarus: "Cool! PCC claims it will get rid of the group, though -- is that expected?" [puppet] - 10https://gerrit.wikimedia.org/r/774939 (owner: 10Muehlenhoff) [19:17:20] (03PS6) 10Juan90264: Fix I7ce58529cdd320a9500dc215291ef1c369cee9d3: Rearranging restriction levels and add editautopatrolprotected for eliminators. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [19:17:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300775)', diff saved to https://phabricator.wikimedia.org/P23605 and previous config saved to /var/cache/conftool/dbconfig/20220329-191731-marostegui.json [19:17:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:17:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:37] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [19:17:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T300775)', diff saved to https://phabricator.wikimedia.org/P23606 and previous config saved to /var/cache/conftool/dbconfig/20220329-191738-marostegui.json [19:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:04] (03CR) 10Juan90264: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [19:20:31] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 65 probes of 673 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:24:29] (03CR) 10Muehlenhoff: imagecatalog: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774939 (owner: 10Muehlenhoff) [19:24:39] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 756 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:26:19] (03CR) 10RLazarus: [C: 03+1] imagecatalog: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774939 (owner: 10Muehlenhoff) [19:27:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23607 and previous config saved to /var/cache/conftool/dbconfig/20220329-192743-ladsgroup.json [19:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [19:28:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [19:28:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [19:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [19:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T297189)', diff saved to https://phabricator.wikimedia.org/P23608 and previous config saved to /var/cache/conftool/dbconfig/20220329-193055-marostegui.json [19:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:01] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [19:35:13] !log mforns@deploy1002 Finished deploy [analytics/refinery@8e9f97c]: Regular analytics weekly train [analytics/refinery@8e9f97c] (duration: 21m 13s) [19:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:29] !log mforns@deploy1002 Started deploy [analytics/refinery@8e9f97c] (thin): Regular analytics weekly train THIN [analytics/refinery@8e9f97c] [19:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:37] !log mforns@deploy1002 Finished deploy [analytics/refinery@8e9f97c] (thin): Regular analytics weekly train THIN [analytics/refinery@8e9f97c] (duration: 00m 08s) [19:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:46] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:47] !log mforns@deploy1002 Started deploy [analytics/refinery@8e9f97c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8e9f97c] [19:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:16] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774953 (owner: 10Eigyan) [19:40:26] !log uploaded cachelib 0.4.1-2~wmf1 to bullseye-wikimedia T301638 [19:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:31] (03PS4) 10Sharvaniharan: Config for new android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773896 [19:40:32] T301638: Upgrade deployment-webperf hosts to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T301638 [19:41:10] mortizm: Thanks! [19:41:45] (ha, apparently I can't type this afternoon) [19:42:13] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23609 and previous config saved to /var/cache/conftool/dbconfig/20220329-194248-ladsgroup.json [19:42:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [19:42:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [19:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:42:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23610 and previous config saved to /var/cache/conftool/dbconfig/20220329-194256-ladsgroup.json [19:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:04] !log mforns@deploy1002 Finished deploy [analytics/refinery@8e9f97c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8e9f97c] (duration: 07m 17s) [19:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:23] (03PS1) 10Reedy: Use namespaced GerritExtDistProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774963 [19:44:08] (03CR) 10Reedy: [C: 04-2] "Not yet... .5 needs to be stable 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774963 (owner: 10Reedy) [19:46:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P23611 and previous config saved to /var/cache/conftool/dbconfig/20220329-194601-marostegui.json [19:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:49] PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:48:21] RECOVERY - Host wdqs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [19:50:27] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [19:55:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23612 and previous config saved to /var/cache/conftool/dbconfig/20220329-195505-ladsgroup.json [19:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:55:56] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me too!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774953 (owner: 10Eigyan) [20:00:04] RoanKattouw and Urbanecm: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220329T2000). Please do the needful. [20:00:05] Jdlrobson, arlolra, sharvani_, eigyan, and Juan_90264: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] present [20:00:37] present [20:00:38] greetings ALL! [20:01:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P23613 and previous config saved to /var/cache/conftool/dbconfig/20220329-200106-marostegui.json [20:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:27] here [20:04:29] 10SRE, 10Infrastructure-Foundations: Many Ganeti hosts have disk space warnings on /boot - https://phabricator.wikimedia.org/T304897 (10MoritzMuehlenhoff) 05Openβ†’03Resolved Unused kernels were pruned. [20:05:17] Hello, I'm present [20:06:11] Backport late? [20:08:28] I am Juan_90264 [20:09:32] hi Juan... may be they are a bit late... not sure. [20:10:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23614 and previous config saved to /var/cache/conftool/dbconfig/20220329-201011-ladsgroup.json [20:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [20:10:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [20:10:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23615 and previous config saved to /var/cache/conftool/dbconfig/20220329-201041-ladsgroup.json [20:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:15:05] Hello? [20:15:49] There's no backporter available. I'm asking around [20:16:04] Roan is coming in a bit.. [20:16:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T297189)', diff saved to https://phabricator.wikimedia.org/P23616 and previous config saved to /var/cache/conftool/dbconfig/20220329-201611-marostegui.json [20:16:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [20:16:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [20:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:17] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [20:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:26] RoanKattouw says he will be back shortly. [20:20:43] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [20:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:14] On my way, just plugging in and turning on my dead laptop [20:23:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23617 and previous config saved to /var/cache/conftool/dbconfig/20220329-202333-ladsgroup.json [20:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:39] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:24:10] So sorry for the delay and the miscommunication everyone. I'm here now, let's get started [20:24:27] (03CR) 10Catrope: [C: 03+2] Restore the classes skin-vector and skin-vector-search-vue to body [skins/Vector] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774840 (owner: 10Jdlrobson) [20:25:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23618 and previous config saved to /var/cache/conftool/dbconfig/20220329-202516-ladsgroup.json [20:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:41] no worries RoanKattouw thank you! [20:25:47] (03CR) 10Catrope: [C: 03+2] End migration mode [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774839 (https://phabricator.wikimedia.org/T301930) (owner: 10Jdlrobson) [20:25:59] I've reached out to someone else at WMF to provide backup for the deployers after a suggestion by Roan FYI @thcipriani [20:26:17] (03PS3) 10Herron: admin: add tsepothoabala to deployment [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [20:26:33] (03CR) 10Catrope: [C: 03+2] Config for new android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773896 (owner: 10Sharvaniharan) [20:27:06] Yeah, we only have two deployers signed up for this window and the other one lives in Europe. I try to make it to this window but it is during lunchtime for me, so that doesn't work every day [20:27:11] (03CR) 10jerkins-bot: [V: 04-1] admin: add tsepothoabala to deployment [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [20:27:15] Sharing the load across more people would be great [20:27:21] Alright, let's start with sharvani_'s patch [20:27:28] (03Merged) 10jenkins-bot: Config for new android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773896 (owner: 10Sharvaniharan) [20:27:52] thank you Roan! appreciate you making the time .. [20:28:07] sharvani_: Your patch is on mwdebug1002 for testing, but it looks like it might be the kind of patch that can't be tested there easily? [20:28:20] If so, we can skip the testing phase and I can deploy it, up to you [20:28:39] (03PS2) 10Catrope: [config]: Deploy gdi-safety-survey to ES,EN,FR and PT wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774953 (owner: 10Eigyan) [20:28:51] tested and it shows up! thank you :) [20:28:59] (03CR) 10Catrope: [C: 03+2] [config]: Deploy gdi-safety-survey to ES,EN,FR and PT wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774953 (owner: 10Eigyan) [20:29:19] Yay! Deploying [20:29:27] eigyan: Your patch is next up [20:29:46] (03Merged) 10jenkins-bot: [config]: Deploy gdi-safety-survey to ES,EN,FR and PT wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774953 (owner: 10Eigyan) [20:29:49] Awesome thank you RoanKattouw [20:29:50] (03PS4) 10Herron: admin: add tsepothoabala to deployment [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [20:30:15] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:773896|Config for new android schemas]] (duration: 01m 00s) [20:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:32] eigyan: Ready for testing now on mwdebug1002 [20:30:49] Testing now RoanKattouw [20:30:49] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10phaultfinder) [20:31:33] (03PS3) 10Catrope: Add wikimedia.com to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555) (owner: 10Arlolra) [20:31:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:52] (03CR) 10Catrope: [C: 03+2] Add wikimedia.com to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555) (owner: 10Arlolra) [20:32:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:43] (03Merged) 10jenkins-bot: Add wikimedia.com to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555) (owner: 10Arlolra) [20:33:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:49] I have validated my patch on mwdebug1002 thanks RoanKattouw [20:35:23] (03PS5) 10Juan90264: Add extendedconfirmed user group for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:35:27] Alright, deploying [20:36:16] (03CR) 10Juan90264: [C: 03+1] "Thanks and LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774834 (https://phabricator.wikimedia.org/T302860) (owner: 10NguoiDungKhongDinhDanh) [20:36:20] ^ gr8 [20:36:35] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:774953|[config]: Deploy gdi-safety-survey to ES,EN,FR and PT wikis]] (duration: 00m 56s) [20:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:45] Okay [20:36:49] arlolra: Your patch is now ready for testing on mwdebug1002 [20:37:13] looks good, thanks [20:37:26] (03PS7) 10Catrope: Fix I7ce58529cdd320a9500dc215291ef1c369cee9d3: Rearranging restriction levels and add editautopatrolprotected for eliminators. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [20:37:47] (03CR) 10Catrope: [C: 03+2] Fix I7ce58529cdd320a9500dc215291ef1c369cee9d3: Rearranging restriction levels and add editautopatrolprotected for eliminators. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [20:38:09] Alright, deploying arlolra's patch. Then Juan_9026480's patches are netx [20:38:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P23619 and previous config saved to /var/cache/conftool/dbconfig/20220329-203838-ladsgroup.json [20:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:52] (03Merged) 10jenkins-bot: Fix I7ce58529cdd320a9500dc215291ef1c369cee9d3: Rearranging restriction levels and add editautopatrolprotected for eliminators. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [20:39:06] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:773302|Add wikimedia.com to wgNoFollowDomainExceptions (T304555)]] (duration: 01m 06s) [20:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:11] T304555: Add wikimedia.com to wgNoFollowDomainExceptions - https://phabricator.wikimedia.org/T304555 [20:39:43] Excelent merged! [20:39:52] thank you RoanKattouw [20:40:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23620 and previous config saved to /var/cache/conftool/dbconfig/20220329-204021-ladsgroup.json [20:40:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:40:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:40:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [20:40:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [20:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23621 and previous config saved to /var/cache/conftool/dbconfig/20220329-204034-ladsgroup.json [20:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:52] Juan_9026480: Your first patch is ready for testing on mwdebug1002 [20:41:33] (03Merged) 10jenkins-bot: Restore the classes skin-vector and skin-vector-search-vue to body [skins/Vector] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774840 (owner: 10Jdlrobson) [20:41:48] Okay, I will test [20:42:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:42:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23622 and previous config saved to /var/cache/conftool/dbconfig/20220329-204241-ladsgroup.json [20:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:18] Juan_9026480: As for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/774384 , that's a patch in MW core not in config, it can just go through the normal code review process and ride the train, right? [20:44:33] Hmm I guess you said it would be deployed on the task, so I'll just do it anyway. But it'll take a bit longer, because it'll have to go through CI twice (first the patch in master, then the backport) [20:44:55] (03Merged) 10jenkins-bot: End migration mode [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774839 (https://phabricator.wikimedia.org/T301930) (owner: 10Jdlrobson) [20:45:27] RoanKattouw: if you're doing it now, probably worth doing namespaceDupes if we have any wiki that language [20:45:33] Yes I will [20:45:49] :) [20:46:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:58] Jdlrobson: Your Vector changes are on mwdebug1002 for testing [20:47:23] RoanKattouw: looking [20:49:46] RoanKattouw: lgtm. Both can be synced [20:50:42] !log catrope@deploy1002 Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. [20:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:22] RoanKattouw: I tested and approved [20:51:30] Oh yikes, deploying the wmf.4 patch caused 500 errors [20:51:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:51:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:07] Serious?! [20:52:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:08] That's a lot of errors [20:53:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P23623 and previous config saved to /var/cache/conftool/dbconfig/20220329-205343-ladsgroup.json [20:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:02] Jdlrobson's patch resulted in errors on the canaries, rolling it back now [20:54:02] yeah, it doesn't look scap friendly at all [20:54:26] !log catrope@deploy1002 Synchronized php-1.39.0-wmf.4/skins/Vector: Backport: Revert: [[gerrit:774839|End migration mode]] (duration: 00m 53s) [20:54:29] Error: Class 'Vector\SkinVersionLookup' not found [20:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:39] Yeah it was also a huge patch [20:55:17] Not sure why it worked in mwdebug1002, maybe it was tested against a wmf.5 wiki instead of a wmf.4 wiki? [20:55:21] RoanKattouw: oh? [20:55:24] hmm [20:55:38] SkinVersionLookup shouldn't exist after that patch [20:55:44] (03PS1) 10Catrope: Revert "End migration mode" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774842 [20:55:49] (03CR) 10Catrope: [C: 03+2] Revert "End migration mode" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774842 (owner: 10Catrope) [20:55:57] oh wait.. maybe this relates to some versioning code that went out recently [20:56:22] Now syncing the wmf.5 Vector patch, which looks much safer [20:56:41] I guess there's another patch in wmf4 that's important [20:56:47] I guess this can wait another 2 days worse case. [20:57:03] !log catrope@deploy1002 Synchronized php-1.39.0-wmf.5/skins/Vector/skin.json: Backport: [[gerrit:774840|Restore the classes skin-vector and skin-vector-search-vue to body]] (duration: 00m 55s) [20:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:15] Jdlrobson: The stack trace does involve service stuff, see https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-2022.03.29?id=gXV1138BnVSED57uHnVa [20:57:23] Yeah you can just wait for the train to roll forward to wmf.5 [20:57:35] Thankfully the canary system caught this [20:57:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23624 and previous config saved to /var/cache/conftool/dbconfig/20220329-205746-ladsgroup.json [20:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:50] Hurrah! (and sorry for the trouble here) [20:57:57] I can't replicate it locally [20:58:06] The problem is that the files don't arrive at the same time. [20:58:37] Now deploying Juan's patch [20:58:40] In this example the ServiceWiring file wasn't updated yet, resulting in it trying to contruct a SkinVersionLookup, which no longer existed. [20:58:46] Okay [20:59:05] You simply can't sync through such big patches without a spike of errors. [20:59:18] Hah, right. I guess if I deployed the files in the right order it would have worked, and if the deploy was allowed to finish it would have worked. But scap aborted because of canary errors [20:59:26] zabe: i see. [20:59:28] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:773320|Fix I7ce58529cdd320a9500dc215291ef1c369cee9d3: Rearranging restriction levels and add editautopatrolprotected for eliminators. (T303579)]] (duration: 00m 56s) [20:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:33] T303579: Create "editautopatrolprotected" protection level for viwiki - https://phabricator.wikimedia.org/T303579 [20:59:48] If I had known this was the case, I could have synced a second time, and it should have gone through I think [20:59:52] But let's not [21:00:12] RoanKattouw: any sense of how bad https://phabricator.wikimedia.org/T302627#7815233 is? [21:00:17] that was the main reason I was backporting [21:01:04] Hah that does look bad [21:01:09] I can give it another shot later [21:01:24] But I really do have to go to the store now so that I can buy things to eat lunch, it's already 2pm [21:02:07] We also have to wait for the very slow CI on Juan's Kashmiri namespaces patch, so I'll deploy that one when I'm back, and then I'll try the Vector patch again [21:02:52] Ok [21:03:47] RoanKattouw: it's been like this a week, so another day won't be a problem. You should eat lunch [21:04:01] I'll also possibly be in the office tomorrow if you wanted to sort it out in person. [21:04:17] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10JMinor) 05Openβ†’03Resolved [21:04:31] Please don't forget another change I left just below the Kashmiri namespace. [21:04:41] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [21:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:58] !log phab2002 - rebooting [21:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23625 and previous config saved to /var/cache/conftool/dbconfig/20220329-210848-ladsgroup.json [21:08:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:08:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23626 and previous config saved to /var/cache/conftool/dbconfig/20220329-210856-ladsgroup.json [21:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:25] !log phab1004 - rebooting [21:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:31] (03Merged) 10jenkins-bot: Revert "End migration mode" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774842 (owner: 10Catrope) [21:12:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23627 and previous config saved to /var/cache/conftool/dbconfig/20220329-211251-ladsgroup.json [21:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:09] !log planet2002 - rebooting [21:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:34] !log planet1002 - rebooting [21:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:58] !log aphlict1001 - rebooting - this will temp break Phabricator realtime notifications but will be back shortly [21:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:26] !log ryankemper@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2007.codfw.wmnet [21:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:22:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:26] !log aphlict1001 - manually starting aphlict service after reboot (was needed for some reason) [21:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:21] PROBLEM - Host wdqs1010 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:45] (JobUnavailable) firing: Reduced availability for job trafficserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:26:47] RECOVERY - Host wdqs1010 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [21:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23628 and previous config saved to /var/cache/conftool/dbconfig/20220329-212756-ladsgroup.json [21:27:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [21:28:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [21:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23629 and previous config saved to /var/cache/conftool/dbconfig/20220329-212804-ladsgroup.json [21:28:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [21:30:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [21:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:11] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23630 and previous config saved to /var/cache/conftool/dbconfig/20220329-213613-ladsgroup.json [21:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:37:08] Hello [21:38:15] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [21:38:15] PROBLEM - Host wdqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:55] RECOVERY - Host wdqs2007 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [21:42:39] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:43:01] Juan_90264: hello [21:43:10] jouncebot: now [21:43:10] No deployments scheduled for the next 9 hour(s) and 16 minute(s) [21:43:46] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:44:45] Mutante: Roan hasn't finished deploying yet, I'm just waiting for him to come back [21:45:12] Juan_90264: ah:) all good! [21:45:57] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:46:04] yea, he went to lunch but said he can give it another shot later [21:46:45] !log doc2001 - rebooting [21:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:24] !log doc1002 - rebooting [21:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23631 and previous config saved to /var/cache/conftool/dbconfig/20220329-215118-ladsgroup.json [21:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:31] !log cumin1001 systemctl status httpbb_hourly_appserver [21:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:40] !log cumin1001 systemctl start httpbb_hourly_appserver [21:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:51] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [21:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:14] (03CR) 10Dzahn: "https://www.mediawiki.org/w/index.php?title=Special:CodeReview (/srv/deployment/httpbb-tests/appserver/test_main.yaml:71)" [puppet] - 10https://gerrit.wikimedia.org/r/774821 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [22:01:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23632 and previous config saved to /var/cache/conftool/dbconfig/20220329-220128-ladsgroup.json [22:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:34] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:03:42] (03PS1) 10Dzahn: httpbb: follow-up to 'fix status code checks for CodeReview redirects' [puppet] - 10https://gerrit.wikimedia.org/r/774981 (https://phabricator.wikimedia.org/T205361) [22:04:02] (03CR) 10Dzahn: httpbb: fix status code checks for CodeReview redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774821 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [22:04:18] (03PS2) 10Dzahn: httpbb: follow-up to 'fix status code checks for CodeReview redirects' [puppet] - 10https://gerrit.wikimedia.org/r/774981 (https://phabricator.wikimedia.org/T205361) [22:05:25] (03CR) 10Dzahn: ""PASS: 116 requests sent to mw1418.eqiad.wmnet. All assertions passed." when using this version manually on cumin1001" [puppet] - 10https://gerrit.wikimedia.org/r/774981 (https://phabricator.wikimedia.org/T205361) (owner: 10Dzahn) [22:06:02] Hi, I'm back [22:06:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23633 and previous config saved to /var/cache/conftool/dbconfig/20220329-220623-ladsgroup.json [22:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:56] (03PS1) 10Catrope: Update Kashmiri namespace names [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774844 (https://phabricator.wikimedia.org/T304790) [22:07:11] (03CR) 10Catrope: [C: 03+2] Update Kashmiri namespace names [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774844 (https://phabricator.wikimedia.org/T304790) (owner: 10Catrope) [22:07:24] (03PS1) 10Catrope: Update Kashmiri namespace names [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774845 (https://phabricator.wikimedia.org/T304790) [22:07:34] (03CR) 10Catrope: [C: 03+2] Update Kashmiri namespace names [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774845 (https://phabricator.wikimedia.org/T304790) (owner: 10Catrope) [22:07:47] RoanKattouw: wb. Juan was waiting for you but you missed him by 5 minutes [22:07:57] mutante: It'll probably take 20 minutes for these patches to go through CI before I can deploy them. Were you planning to restart anything today? [22:08:06] Oh I see he left :( [22:08:57] OK then I'll cancel and reschedule these for tomorrow [22:09:24] RoanKattouw: I am rebooting things but will not touch contint if that's what you meant [22:09:38] Oh, no, I was wondering if you were going to touch any deployment-related hosts [22:10:15] hmm.. I wasn't immediately planning it..no.. But I also do see now they should be [22:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:12:12] RoanKattouw: not sure if it's the good or the bad time to reboot when fewer people around :) [22:12:37] (03CR) 10Ebernhardson: [C: 03+2] team-search-platform: add jvmquake alerting [alerts] - 10https://gerrit.wikimedia.org/r/773758 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [22:14:48] !log doc1001 - rebooting (doc.wikimedia.org) [22:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P23634 and previous config saved to /var/cache/conftool/dbconfig/20220329-221634-ladsgroup.json [22:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23635 and previous config saved to /var/cache/conftool/dbconfig/20220329-222128-ladsgroup.json [22:21:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [22:21:31] (03PS2) 10Ebernhardson: team-search-platform: add jvmquake alerting [alerts] - 10https://gerrit.wikimedia.org/r/773758 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [22:21:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [22:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:21:34] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:21:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23636 and previous config saved to /var/cache/conftool/dbconfig/20220329-222141-ladsgroup.json [22:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:55] (03CR) 10Ebernhardson: [C: 03+2] team-search-platform: add jvmquake alerting [alerts] - 10https://gerrit.wikimedia.org/r/773758 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [22:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23637 and previous config saved to /var/cache/conftool/dbconfig/20220329-222650-ladsgroup.json [22:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:28:04] (03Merged) 10jenkins-bot: team-search-platform: add jvmquake alerting [alerts] - 10https://gerrit.wikimedia.org/r/773758 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [22:29:41] (03CR) 10Cwhite: [C: 03+1] "It's good enough until we arrive at a better solution." [puppet] - 10https://gerrit.wikimedia.org/r/774364 (https://phabricator.wikimedia.org/T304924) (owner: 10Filippo Giunchedi) [22:30:11] !log moscovium (rt.wikimedia.org) - rebooting [22:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:35] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [22:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P23638 and previous config saved to /var/cache/conftool/dbconfig/20220329-223139-ladsgroup.json [22:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:35] !log mwdebug2002 - rebooting [22:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:09] !log mwdebug2001 - rebooting [22:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [22:39:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [22:39:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance [22:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance [22:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:01] (BlazegraphJvmQuakeWarnGC) firing: (3) Blazegraph instance wdqs2001:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [22:41:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23639 and previous config saved to /var/cache/conftool/dbconfig/20220329-224155-ladsgroup.json [22:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:01] (BlazegraphJvmQuakeWarnGC) firing: (7) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [22:46:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23640 and previous config saved to /var/cache/conftool/dbconfig/20220329-224644-ladsgroup.json [22:46:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [22:46:47] (03CR) 10Dzahn: [C: 03+2] "I am not claiming I know why it ends at "https://www.mediawiki.org/wiki/Special:Code" but this is how the test passes." [puppet] - 10https://gerrit.wikimedia.org/r/774981 (https://phabricator.wikimedia.org/T205361) (owner: 10Dzahn) [22:46:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [22:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:46:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23641 and previous config saved to /var/cache/conftool/dbconfig/20220329-224652-ladsgroup.json [22:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:13] (03CR) 10Dzahn: "systemctl status httpbb_hourly_appserver" [puppet] - 10https://gerrit.wikimedia.org/r/774981 (https://phabricator.wikimedia.org/T205361) (owner: 10Dzahn) [22:49:51] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:25] !log cumin1001 - systemctl start httpbb_hourly_appserver fixed Icinga alert after gerrit:774981 T205361 [22:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:31] T205361: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 [22:50:40] (03CR) 10Dzahn: "22:49 <+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/774981 (https://phabricator.wikimedia.org/T205361) (owner: 10Dzahn) [22:56:15] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:57:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23642 and previous config saved to /var/cache/conftool/dbconfig/20220329-225700-ladsgroup.json [22:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:08] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/34617/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [23:00:01] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:36] (03CR) 10Dzahn: "noop on gitlab1001 confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [23:00:59] (03PS3) 10Dzahn: gitlab: run backup and restore twice daily [puppet] - 10https://gerrit.wikimedia.org/r/774418 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [23:03:19] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34618/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/774418 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [23:05:56] RoanKattouw Hello? [23:06:17] 22:07 < mutante> RoanKattouw: wb. Juan was waiting for you but you missed him by 5 minutes [23:06:20] 22:07 < RoanKattouw> mutante: It'll probably take 20 minutes for these patches to go through CI before I can deploy them. Were you planning to restart anything today? [23:06:23] 22:08 < RoanKattouw> Oh I see he left :( [23:06:26] 22:08 < RoanKattouw> OK then I'll cancel and reschedule these for tomorrow [23:06:29] Juan_90264: ^ [23:06:31] RoanKattouw: ^ [23:07:12] Oh hello! Welcome back! [23:07:24] Sorry for being away for so long [23:07:36] I can deploy the Kashmiri aliases now if you like [23:10:45] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:12:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23643 and previous config saved to /var/cache/conftool/dbconfig/20220329-231205-ladsgroup.json [23:12:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [23:12:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [23:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:12:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:12:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [23:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [23:12:26] (03CR) 10Dzahn: "BEFORE:" [puppet] - 10https://gerrit.wikimedia.org/r/774418 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [23:12:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [23:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [23:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [23:12:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [23:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23644 and previous config saved to /var/cache/conftool/dbconfig/20220329-231248-ladsgroup.json [23:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:10] (03CR) 10Dzahn: "unrelated to this change, but is "Unit partial-backup.timer could not be found." to be expected?" [puppet] - 10https://gerrit.wikimedia.org/r/774418 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [23:14:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23645 and previous config saved to /var/cache/conftool/dbconfig/20220329-231456-ladsgroup.json [23:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:40] Okay [23:15:45] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:16:45] RoanKattouw: Let's deploy? [23:17:23] (03CR) 10Catrope: [C: 03+2] Update Kashmiri namespace names [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774845 (https://phabricator.wikimedia.org/T304790) (owner: 10Catrope) [23:17:26] (03CR) 10Catrope: [C: 03+2] Update Kashmiri namespace names [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774844 (https://phabricator.wikimedia.org/T304790) (owner: 10Catrope) [23:17:49] Alright, I've +2ed the cherry-picks. It'll take a while for these to be merged, usually 15-20 minutes [23:18:33] Okay [23:30:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23646 and previous config saved to /var/cache/conftool/dbconfig/20220329-233001-ladsgroup.json [23:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:31] (03Merged) 10jenkins-bot: Update Kashmiri namespace names [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774845 (https://phabricator.wikimedia.org/T304790) (owner: 10Catrope) [23:34:54] (03Merged) 10jenkins-bot: Update Kashmiri namespace names [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774844 (https://phabricator.wikimedia.org/T304790) (owner: 10Catrope) [23:37:12] Juan_90264: Your patches are now ready for testing on mwdebug1002, please test [23:37:43] Yes, I Will test [23:39:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23647 and previous config saved to /var/cache/conftool/dbconfig/20220329-234000-ladsgroup.json [23:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:40:07] (03PS1) 10Catrope: Revert "Revert "End migration mode"" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774986 [23:40:12] (03CR) 10Catrope: [C: 03+2] Revert "Revert "End migration mode"" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774986 (owner: 10Catrope) [23:40:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:40:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23648 and previous config saved to /var/cache/conftool/dbconfig/20220329-234506-ladsgroup.json [23:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:14] How strange... I don't see any change in namespaces using mwdebug1002 or mwdebug1001 [23:55:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23649 and previous config saved to /var/cache/conftool/dbconfig/20220329-235505-ladsgroup.json [23:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:12] (03Merged) 10jenkins-bot: Revert "Revert "End migration mode"" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774986 (owner: 10Catrope) [23:59:24] RoanKattouw: I'm testing, but the change is not showing up with mwdebug1002 [23:59:56] Juan_90264: Hmm, maybe this is one of these changes that's untestable with mwdebug1002