[00:00:05] RoanKattouw and Urbanecm: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T0000). [00:00:05] No Gerrit patches in the queue for this window AFAICS. [00:12:02] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:18] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761064 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [00:24:06] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:20] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:36:06] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:41:24] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:48:52] (03PS1) 10Ebernhardson: [WIP] Remove blazegraph_options parameter [puppet] - 10https://gerrit.wikimedia.org/r/761080 [00:49:44] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761080 (owner: 10Ebernhardson) [00:53:02] (03PS2) 10Ebernhardson: [WIP] Remove blazegraph_options parameter [puppet] - 10https://gerrit.wikimedia.org/r/761080 [00:54:22] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761080 (owner: 10Ebernhardson) [00:54:55] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Remove blazegraph_options parameter [puppet] - 10https://gerrit.wikimedia.org/r/761080 (owner: 10Ebernhardson) [01:05:47] (03PS3) 10Ebernhardson: [WIP] Remove blazegraph_options parameter [puppet] - 10https://gerrit.wikimedia.org/r/761080 [01:05:49] (03CR) 10Ebernhardson: "I wonder if instead of this we should do something more like Iae729082fb1d9e06. Drop the OAUTH_* environment vars alltogether and provide " [puppet] - 10https://gerrit.wikimedia.org/r/757996 (owner: 10Ebernhardson) [01:29:40] (03PS1) 10Ladsgroup: Revert "db1182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761086 [01:34:44] (03PS2) 10Ladsgroup: Revert "db1182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761086 [01:35:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761086 (owner: 10Ladsgroup) [01:37:00] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:11:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [02:11:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [02:11:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 11 hosts with reason: Maintenance [02:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 11 hosts with reason: Maintenance [02:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:42] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:34:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [02:34:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [02:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298554)', diff saved to https://phabricator.wikimedia.org/P20403 and previous config saved to /var/cache/conftool/dbconfig/20220209-023446-ladsgroup.json [02:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:51] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [02:35:12] (03PS11) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [02:58:21] PROBLEM - Check systemd state on elastic1040 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298554)', diff saved to https://phabricator.wikimedia.org/P20404 and previous config saved to /var/cache/conftool/dbconfig/20220209-030245-ladsgroup.json [03:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:51] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [03:17:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P20405 and previous config saved to /var/cache/conftool/dbconfig/20220209-031750-ladsgroup.json [03:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:01] (03PS2) 10BBlack: Add drmrs to smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/760616 (https://phabricator.wikimedia.org/T282788) [03:32:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P20406 and previous config saved to /var/cache/conftool/dbconfig/20220209-033255-ladsgroup.json [03:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298554)', diff saved to https://phabricator.wikimedia.org/P20407 and previous config saved to /var/cache/conftool/dbconfig/20220209-034800-ladsgroup.json [03:48:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [03:48:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [03:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:05] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [03:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:10:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:46] (03CR) 10Krinkle: [C: 03+1] "Good to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760640 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [04:53:13] (03PS2) 10Krinkle: Start writing to $wmgConfigDir the same value as to $wmfConfigDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [04:54:00] (03CR) 10Krinkle: "Rebased on I074a2e5a6f to increase confidence in the review since the diff context was master on master and may end up different after aut" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [04:54:54] (03CR) 10Krinkle: Start writing to $wmgConfigDir the same value as to $wmfConfigDir (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [05:16:51] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:20:41] (03PS1) 10Marostegui: db2076,db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/761226 (https://phabricator.wikimedia.org/T301313) [06:22:18] (03CR) 10Marostegui: [C: 03+2] db2076,db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/761226 (https://phabricator.wikimedia.org/T301313) (owner: 10Marostegui) [07:14:30] PROBLEM - SSH on bast4003 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:16:30] RECOVERY - SSH on bast4003 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:18:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:35:16] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:35:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager group from s1 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P20410 and previous config saved to /var/cache/conftool/dbconfig/20220209-073528-marostegui.json [07:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:35] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [07:36:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33634/console" [puppet] - 10https://gerrit.wikimedia.org/r/760956 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:37:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:34] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 271 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:40:44] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kubernetes::node: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/760956 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:41:48] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:42:48] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [07:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:21] (03CR) 10Filippo Giunchedi: watchrat: route alerts to irc and noc@ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761064 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [07:46:49] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: add configurable execute_alerts option [puppet] - 10https://gerrit.wikimedia.org/r/761026 (https://phabricator.wikimedia.org/T300997) (owner: 10Cwhite) [07:49:02] (03PS1) 10Alexandros Kosiaris: Update for bullseye [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 [07:50:46] (03CR) 10Filippo Giunchedi: graphite: whisper_cleanup: migrate cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:51:21] (03PS2) 10Alexandros Kosiaris: Update for bullseye [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 [07:55:05] (03PS1) 10Elukey: kubernetes: add bullseye apt configuration [puppet] - 10https://gerrit.wikimedia.org/r/761275 (https://phabricator.wikimedia.org/T300744) [07:55:28] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [07:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33635/console" [puppet] - 10https://gerrit.wikimedia.org/r/761275 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:09:50] (03PS1) 10Filippo Giunchedi: prometheus: add 'tcp' probe type [puppet] - 10https://gerrit.wikimedia.org/r/761279 (https://phabricator.wikimedia.org/T291946) [08:18:17] (03PS3) 10Alexandros Kosiaris: Update for bullseye [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 [08:21:15] !log restarting blazegraph on wdqs1004 (jvm stuck for 5hours) [08:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:48] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1004 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:25:12] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:29:47] (03PS4) 10Alexandros Kosiaris: Update for bullseye [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 [08:30:12] (03CR) 10DCausse: [C: 03+1] rdf-query-service: consistently suffix env vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757996 (owner: 10Ebernhardson) [08:33:02] (03PS2) 10Filippo Giunchedi: prometheus: add 'tcp' probe type [puppet] - 10https://gerrit.wikimedia.org/r/761279 (https://phabricator.wikimedia.org/T291946) [08:33:21] (03CR) 10Muehlenhoff: Update for bullseye (033 comments) [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 (owner: 10Alexandros Kosiaris) [08:34:50] (03PS5) 10Alexandros Kosiaris: Update for bullseye [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 (https://phabricator.wikimedia.org/T300568) [08:48:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33637/console" [puppet] - 10https://gerrit.wikimedia.org/r/761275 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:49:26] (03CR) 10Elukey: [C: 03+1] Bump resources of cert-manager and components [deployment-charts] - 10https://gerrit.wikimedia.org/r/761000 (owner: 10JMeybohm) [08:51:05] (03CR) 10Elukey: [C: 03+2] Update elukey's ssh public key [homer/public] - 10https://gerrit.wikimedia.org/r/760937 (owner: 10Elukey) [08:51:33] (03CR) 10Alexandros Kosiaris: Update for bullseye (033 comments) [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 (https://phabricator.wikimedia.org/T300568) (owner: 10Alexandros Kosiaris) [08:51:41] (03PS6) 10Alexandros Kosiaris: Update for bullseye [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 (https://phabricator.wikimedia.org/T300568) [08:52:29] (03PS7) 10Alexandros Kosiaris: Update for bullseye [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 (https://phabricator.wikimedia.org/T300568) [08:53:28] (03CR) 10Elukey: "Self answer: "Must follow SSHv2 or SSHv1 RSA key format"" [homer/public] - 10https://gerrit.wikimedia.org/r/760937 (owner: 10Elukey) [08:53:40] (03CR) 10Volans: [C: 03+1] "LGTM, minor nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond) [08:55:24] elukey: yeah that won't work on some devices [08:55:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Many thanks for this! We 've been planning on doing this for a bit now, it's about time. +1 from me on premise. I 've also had a quick loo" [puppet] - 10https://gerrit.wikimedia.org/r/761275 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:56:16] (03PS1) 10Elukey: Change elukey's ssh public key [homer/public] - 10https://gerrit.wikimedia.org/r/761281 [08:56:40] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/761281 (owner: 10Elukey) [08:57:22] volans: <3 [08:57:46] (03CR) 10Elukey: [C: 03+2] Change elukey's ssh public key [homer/public] - 10https://gerrit.wikimedia.org/r/761281 (owner: 10Elukey) [08:57:59] volans: I guess that I'd need to run homer on all devices right? [08:58:21] I can test on one first of course [08:58:57] yes and sure [08:59:21] elukey: *but* new eqiad expansion ones might not yet be fully homerized [08:59:50] so I suggest you say "ok"only to those that have as diff only your key and nothing else [09:01:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 (https://phabricator.wikimedia.org/T300568) (owner: 10Alexandros Kosiaris) [09:01:45] volans: I'd love to apply uncommitted changes to random devices, you are a party breaker [09:02:09] yeah because you'll revert them :-P [09:02:36] so you'll be the party breaker :D [09:04:10] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10User-ema: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) Although we did briefly discuss the results of this experiment within #traffic, I don't think we ever publicly disclosed our analysis.... [09:06:21] 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Debug / fine tune puppet failed metrics and alerts on alert* hosts - https://phabricator.wikimedia.org/T299628 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Tentatively resolving [09:09:39] (03PS2) 10Filippo Giunchedi: team-sre: port job unavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/744033 (https://phabricator.wikimedia.org/T288726) [09:11:46] (03CR) 10jerkins-bot: [V: 04-1] team-sre: port job unavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/744033 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [09:13:04] (03CR) 10Ema: [V: 03+1 C: 03+2] Revert "ATS: lower number of allowed Lua states on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/760889 (https://phabricator.wikimedia.org/T265625) (owner: 10Ema) [09:13:51] (03PS3) 10Filippo Giunchedi: team-sre: port job unavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/744033 (https://phabricator.wikimedia.org/T288726) [09:15:51] !log cp3050: ats-backend-restart to set the number of allowed Lua states back from 64 to 256 (default) T265625 [09:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:56] T265625: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 [09:16:45] (03PS2) 10Filippo Giunchedi: prometheus: remove job unavailable alert [puppet] - 10https://gerrit.wikimedia.org/r/744035 (https://phabricator.wikimedia.org/T288726) [09:16:51] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: port job unavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/744033 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [09:16:59] (03PS4) 10Filippo Giunchedi: team-sre: port job unavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/744033 (https://phabricator.wikimedia.org/T288726) [09:18:51] (03CR) 10jerkins-bot: [V: 04-1] prometheus: remove job unavailable alert [puppet] - 10https://gerrit.wikimedia.org/r/744035 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [09:19:24] (03PS1) 10MVernon: swift: remove ms-fe200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/761283 (https://phabricator.wikimedia.org/T301251) [09:23:27] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/744035 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [09:24:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761273 (https://phabricator.wikimedia.org/T300568) (owner: 10Alexandros Kosiaris) [09:26:25] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) And we're good now: ` thanos-be2001:~$ ls -la /dev/disk/by-path/ | grep -v part | sort -k11 drwxr-xr-x 2 root root 720 Feb 9 07:51 . drwxr-xr-x 8 root root 160 Feb 9 07:51 .. tot... [09:26:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove job unavailable alert [puppet] - 10https://gerrit.wikimedia.org/r/744035 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [09:27:22] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missi [09:27:22] [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:30:04] (03CR) 10JMeybohm: [C: 03+2] Bump resources of cert-manager and components [deployment-charts] - 10https://gerrit.wikimedia.org/r/761000 (owner: 10JMeybohm) [09:31:05] (03PS2) 10Ema: Revert "cache: test atskafka webrequest on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/760895 (https://phabricator.wikimedia.org/T247497) [09:33:04] (03CR) 10Ema: [C: 03+2] Revert "cache: test atskafka webrequest on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/760895 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [09:33:50] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: remove ms-fe200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/761283 (https://phabricator.wikimedia.org/T301251) (owner: 10MVernon) [09:34:03] (03Merged) 10jenkins-bot: Bump resources of cert-manager and components [deployment-charts] - 10https://gerrit.wikimedia.org/r/761000 (owner: 10JMeybohm) [09:38:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 275 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:39:45] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:40:03] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.0.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/761284 [09:41:22] !log cp3050: stop and disable atskafka-webrequest.service T247497 [09:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:27] T247497: Test atskafka deployment - https://phabricator.wikimedia.org/T247497 [09:41:48] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.0.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/761284 (owner: 10Volans) [09:43:46] (03CR) 10JMeybohm: [C: 03+1] "As said on IRC I'd like us to get rid of `packages_from_future` (maybe in favor of passing the apt component to the classes for multi-vers" [puppet] - 10https://gerrit.wikimedia.org/r/761275 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:44:13] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.0.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/761284 (owner: 10Volans) [09:44:28] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:28] (03CR) 10MVernon: [C: 03+2] swift: remove ms-fe200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/761283 (https://phabricator.wikimedia.org/T301251) (owner: 10MVernon) [09:45:38] !log update my ssh key on all network devices (will commit only when the diff is my key only) [09:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:44] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:54] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:46] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:10] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:48:32] PROBLEM - Check systemd state on snapshot1012 is CRITICAL: CRITICAL - degraded: The following units failed: fulldumps-rest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:45] (JobUnavailable) firing: (6) Reduced availability for job atskafka in esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:51:19] (03PS1) 10Filippo Giunchedi: alertmanager: inhibit warnings when a match critical alert is firing [puppet] - 10https://gerrit.wikimedia.org/r/761285 [09:55:09] the atskafka alerts are expected, ema is removing it from cp3050 [09:55:37] indeed [09:56:12] (03CR) 10Elukey: [V: 03+1 C: 03+2] kubernetes: add bullseye apt configuration [puppet] - 10https://gerrit.wikimedia.org/r/761275 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:01:15] (03PS1) 10ArielGlenn: fix up siteinfov2 job entry in api jobs table [dumps] - 10https://gerrit.wikimedia.org/r/761287 [10:01:54] (03PS1) 10Ema: prometheus::ops: remove atskafka job definition [puppet] - 10https://gerrit.wikimedia.org/r/761288 (https://phabricator.wikimedia.org/T247497) [10:02:15] !log rolling restart of swift frontends T301251 [10:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:20] T301251: Decommission ms-fe200[5-8] - https://phabricator.wikimedia.org/T301251 [10:02:32] (03CR) 10Elukey: [C: 03+1] prometheus::ops: remove atskafka job definition [puppet] - 10https://gerrit.wikimedia.org/r/761288 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:03:47] (03CR) 10JMeybohm: [C: 03+2] Enable nodePort 30021 for ingressgateway status [deployment-charts] - 10https://gerrit.wikimedia.org/r/759726 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:03:50] (03CR) 10JMeybohm: [C: 03+2] Add ingress.staging switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/759727 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:03:54] (03CR) 10JMeybohm: [C: 03+2] Add ingress support to miscweb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/757935 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:03:58] !log T300568 upload prometheus-etherpad-exporter_0.4_amd64 to apt.wikimedia.org bullseye-wikimedia/main [10:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:02] T300568: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 [10:06:49] (03PS1) 10Volans: Upstream release v1.0.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/761291 [10:07:37] (03Merged) 10jenkins-bot: Enable nodePort 30021 for ingressgateway status [deployment-charts] - 10https://gerrit.wikimedia.org/r/759726 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:07:40] (03CR) 10Volans: [C: 03+2] Upstream release v1.0.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/761291 (owner: 10Volans) [10:07:42] (03Merged) 10jenkins-bot: Add ingress.staging switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/759727 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:07:44] (03Merged) 10jenkins-bot: Add ingress support to miscweb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/757935 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:09:58] (03Merged) 10jenkins-bot: Upstream release v1.0.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/761291 (owner: 10Volans) [10:10:54] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10LSobanski) Is gitlab-roots actually needed? Maybe the group does something else than the name would suggest but I thought root access on Gitlab hosts was restricted to SRE? [10:11:14] (03PS3) 10JMeybohm: miscweb: Remove repeating settings and enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757936 (https://phabricator.wikimedia.org/T290966) [10:11:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-fe[2005-2008].codfw.wmnet [10:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:27] (03CR) 10ArielGlenn: [C: 03+2] fix up siteinfov2 job entry in api jobs table [dumps] - 10https://gerrit.wikimedia.org/r/761287 (owner: 10ArielGlenn) [10:14:44] !log uploaded python3-wmflib_1.0.1 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [10:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:52] (03Merged) 10jenkins-bot: fix up siteinfov2 job entry in api jobs table [dumps] - 10https://gerrit.wikimedia.org/r/761287 (owner: 10ArielGlenn) [10:15:13] (03PS1) 10MVernon: hieradata: move codfs swift::stats_reporter_host [puppet] - 10https://gerrit.wikimedia.org/r/761293 (https://phabricator.wikimedia.org/T301251) [10:15:16] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ms-fe[2005-2008].codfw.wmnet [10:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:55] (03CR) 10MVernon: "...there's always one more thing!" [puppet] - 10https://gerrit.wikimedia.org/r/761293 (https://phabricator.wikimedia.org/T301251) (owner: 10MVernon) [10:15:58] !log ariel@deploy1002 Started deploy [dumps/dumps@9993036]: fix up default api jobs entry for siteinfo v2 [10:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:01] !log ariel@deploy1002 Finished deploy [dumps/dumps@9993036]: fix up default api jobs entry for siteinfo v2 (duration: 00m 03s) [10:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:46] !log update scap to 4.3.1 on A:mw-canary or A:parsoid-canary or A:mw-jobrunner-canary - T301307 [10:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:51] T301307: Deploy Scap version 4.3.1 - https://phabricator.wikimedia.org/T301307 [10:17:51] (03CR) 10JMeybohm: [C: 03+2] miscweb: Remove repeating settings and enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757936 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:20:43] !log update scap to 4.3.1 on A:restbase-canary - T301307 [10:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:46] (03Merged) 10jenkins-bot: miscweb: Remove repeating settings and enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757936 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:22:02] PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:15] !log jelto@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [10:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:27] (03Abandoned) 10Jbond: DO NOT MERGE: example ci to demonstrate possible securecommand [software/cumin] - 10https://gerrit.wikimedia.org/r/760635 (owner: 10Jbond) [10:25:37] !log jelto@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 22s) [10:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:45] (03PS1) 10Filippo Giunchedi: hieradata: move prometheus_nodes to WMCS role-based hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) [10:27:50] (03CR) 10DCausse: [C: 04-1] "putting a -1 to prevent merging this accidentally as I believe the runBlazegraph.sh script is not properly using these vars" [puppet] - 10https://gerrit.wikimedia.org/r/757996 (owner: 10Ebernhardson) [10:28:39] (03CR) 10Volans: "I have little context to say with certainty if this change is enough to move it, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761293 (https://phabricator.wikimedia.org/T301251) (owner: 10MVernon) [10:29:35] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: move codfs swift::stats_reporter_host [puppet] - 10https://gerrit.wikimedia.org/r/761293 (https://phabricator.wikimedia.org/T301251) (owner: 10MVernon) [10:29:58] (03CR) 10MVernon: [C: 03+2] hieradata: move codfs swift::stats_reporter_host [puppet] - 10https://gerrit.wikimedia.org/r/761293 (https://phabricator.wikimedia.org/T301251) (owner: 10MVernon) [10:30:22] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10jbond) > I was also able to confirm that this problem is specific to nginx. My proposal is to switch our mirror to apache2 until the bug is resolved. +1 i think the trend in the... [10:30:53] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33640/console" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [10:32:40] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main [10:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:19] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main [10:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:38] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main [10:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:46] (03PS2) 10Filippo Giunchedi: hieradata: move prometheus_nodes to WMCS role-based hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) [10:34:58] (03PS1) 10Giuseppe Lavagetto: draft kube [software/spicerack] - 10https://gerrit.wikimedia.org/r/761296 [10:35:00] (03PS1) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) [10:35:18] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2044 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:35:23] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main [10:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:53] (03PS2) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) [10:36:12] (03Abandoned) 10Giuseppe Lavagetto: draft kube [software/spicerack] - 10https://gerrit.wikimedia.org/r/761296 (owner: 10Giuseppe Lavagetto) [10:36:25] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33641/console" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [10:41:38] (03PS1) 10Alexandros Kosiaris: Switch to python3's urllib [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761299 [10:41:51] (03PS10) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [10:42:06] (03CR) 10jerkins-bot: [V: 04-1] k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [10:42:48] (03CR) 10jerkins-bot: [V: 04-1] k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [10:44:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch to python3's urllib [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/761299 (owner: 10Alexandros Kosiaris) [10:45:28] !log T300568 upload prometheus-etherpad-exporter_0.5_amd64 to apt.wikimedia.org bullseye-wikimedia/main [10:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:32] T300568: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 [10:46:20] RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:28] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 254 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:50:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-fe[2005-2008].codfw.wmnet [10:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:06] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: execution: found duplicate series for the match group {account=mw-media, class=deleted} on the right hand-side of the operation: [{__name__=swift_container_stats_objects_total, account=mw-media, class=deleted, cluster=swift, instance=ms-fe2009:9112, job=statsd_exporter, site=codfw}, {__name__=swift_container_stats_objects_total, account=mw-media, cl [10:52:06] ted, cluster=swift, instance=ms-fe2005:9112, job=statsd_exporter, site=codfw}]:many-to-many matching not allowed: matching labels must be unique on one side https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [10:52:28] PROBLEM - Number of mw swift objects in codfw greater than eqiad on alert1001 is CRITICAL: execution: multiple matches for labels: many-to-one matching must be explicit (group_left/group_right) https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=codfw [10:54:22] interesting, I'm assuming that's because of the flip of stat reporting host, should be recovering by itself I believe cc Emperor [10:54:44] (03CR) 10Jbond: gitlab_runner: execute gitlab-runner as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:56:27] godog: I hope so! I did check the old node wasn't running stats any more and the new node looked to be doing so [10:56:58] RECOVERY - Number of mw swift objects in eqiad greater than codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [10:57:00] but I'm don't really grok what that alert is complaining about [10:57:00] Emperor: ack, yeah that makes sense [10:57:02] there we go [10:57:07] \o/ [10:57:08] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:57:18] RECOVERY - Number of mw swift objects in codfw greater than eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=codfw [10:58:38] (03PS1) 10Volans: spcierack: adapt type hint to latest wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/761302 [10:59:37] godog: I expect we'll get the same when we do similar to the eqiad setup in due course [11:01:36] Emperor: yeah [11:01:40] I'm expecting the same [11:02:42] PROBLEM - Ganeti memory on ganeti1022 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (255179) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [11:02:45] only mildly alarming :) [11:06:27] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33642/console" [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [11:06:44] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2044 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:07:48] (03CR) 10jerkins-bot: [V: 04-1] spcierack: adapt type hint to latest wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/761302 (owner: 10Volans) [11:08:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-fe[2005-2008].codfw.wmnet [11:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:26] 10SRE-swift-storage, 10Patch-For-Review: Decommission ms-fe200[5-8] - https://phabricator.wikimedia.org/T301251 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-fe[2005-2008].codfw.wmnet` - ms-fe2005.codfw.wmnet (**PASS**) - Downtimed host on Icinga - Fou... [11:09:15] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [11:12:02] (03PS2) 10Volans: spcierack: adapt type hint to latest wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/761302 [11:12:15] (03CR) 10MMandere: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/760613 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [11:16:38] (03PS8) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [11:17:54] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:19:41] (03PS9) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [11:19:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: make sure no webservices are left started [puppet] - 10https://gerrit.wikimedia.org/r/761304 [11:20:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [11:20:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [11:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300775)', diff saved to https://phabricator.wikimedia.org/P20411 and previous config saved to /var/cache/conftool/dbconfig/20220209-112029-marostegui.json [11:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:34] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [11:22:50] (03CR) 10Jbond: [C: 03+1] spcierack: adapt type hint to latest wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/761302 (owner: 10Volans) [11:23:43] (03PS3) 10Zabe: Start writing to $wmgConfigDir the same value as to $wmfConfigDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) [11:24:09] (03CR) 10Volans: [C: 03+2] spcierack: adapt type hint to latest wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/761302 (owner: 10Volans) [11:25:12] (03CR) 10Zabe: Start writing to $wmgConfigDir the same value as to $wmfConfigDir (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [11:31:14] 10SRE-swift-storage, 10ops-codfw, 10decommission-hardware: Decommission ms-fe200[5-8].codfw.wmnet - https://phabricator.wikimedia.org/T301334 (10MatthewVernon) [11:32:29] (03PS5) 10Zabe: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) [11:32:52] 10SRE-swift-storage, 10Patch-For-Review: Decommission ms-fe200[5-8] - https://phabricator.wikimedia.org/T301251 (10MatthewVernon) 05Open→03Resolved This is now done - T301334 is the DC-team task to actually decommission the hardware. [11:33:29] (03Merged) 10jenkins-bot: spcierack: adapt type hint to latest wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/761302 (owner: 10Volans) [11:37:23] (03PS6) 10Zabe: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) [11:38:10] (03CR) 10jerkins-bot: [V: 04-1] graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:38:43] (03PS7) 10Zabe: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) [11:39:04] (03PS8) 10Zabe: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) [11:41:17] 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Jelto) 05In progress→03Resolved `gitlab.wikimedia.org` and `gitlab-replica.wikimedia.org` can be reached over IPv6 now! ` $ curl -s -I -6 https://gitlab.wikimedia.org/explore | gr... [11:41:43] (03PS10) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [11:42:15] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/33644/graphite2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:42:28] (03CR) 10MMandere: [C: 03+1] "LGTM, should we have these alerts moved to operations/alerts repo?" [puppet] - 10https://gerrit.wikimedia.org/r/760615 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [11:48:39] (03PS2) 10Zabe: graphite: whisper_cleanup: remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/751471 (https://phabricator.wikimedia.org/T273673) [11:52:44] (03PS2) 10Zabe: graphite: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/751208 (https://phabricator.wikimedia.org/T273673) [11:58:10] (03CR) 10MMandere: [C: 03+1] "LGTM! Checked access Switch DNS Names match with what we have in NetBox." [puppet] - 10https://gerrit.wikimedia.org/r/760616 (https://phabricator.wikimedia.org/T282788) (owner: 10BBlack) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T1200). [12:00:05] zabe: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:12] o/ [12:00:28] o/ [12:01:44] alright, I can deploy [12:02:42] (03PS2) 10Lucas Werkmeister (WMDE): Stop writing to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760640 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:05:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop writing to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760640 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:06:16] (03Merged) 10jenkins-bot: Stop writing to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760640 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:06:48] zabe: the patch is on mwdebug1001, please test it there [12:07:13] (03PS3) 10Volans: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [12:09:10] Lucas_WMDE: nothing seems to break and I can see anything in logstash, so I think we are good to go [12:09:17] alright [12:09:25] I did a grep -rF wmfRealm and found no remaining matches either [12:09:32] (except a git packfile ^^) [12:09:52] do these need to be synced in a particular order? [12:10:02] no [12:10:22] (03PS1) 10Jbond: P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 [12:10:25] alright, syncing tests first [12:10:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:29] (03PS1) 10Jbond: puppetboard-samltest: create new host for testing [dns] - 10https://gerrit.wikimedia.org/r/761308 [12:11:47] (03CR) 10jerkins-bot: [V: 04-1] P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 (owner: 10Jbond) [12:11:49] (03PS1) 10Arturo Borrero Gonzalez: toolforge: grid: always install latest jobutils package [puppet] - 10https://gerrit.wikimedia.org/r/761309 [12:11:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized tests/loggingTest.php: Config: [[gerrit:760640|Stop writing to $wmfRealm (T45956)]] (1/3) (duration: 01m 38s) [12:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:02] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [12:12:24] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10MoritzMuehlenhoff) >>! In T300985#7694737, @jhathaway wrote: > I was able to confirm that the problem is due to https://salsa.debian.org/apt-team/apt/-/commit/fa375493c5a4ed9c10... [12:13:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized multiversion/buildConfigCache.php: Config: [[gerrit:760640|Stop writing to $wmfRealm (T45956)]] (2/3) (duration: 00m 49s) [12:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:20] !log lucaswerkmeister-wmde@deploy1002 Synchronized multiversion/MWRealm.php: Config: [[gerrit:760640|Stop writing to $wmfRealm (T45956)]] (3/3) (duration: 00m 49s) [12:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:36] jouncebot: now [12:14:37] For the next 0 hour(s) and 45 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T1200) [12:15:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:15:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:04] nn1l2: do you want to add something to the window? [12:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:18] I'm preparing a patch [12:15:25] It may take 5 mins [12:16:04] Lucas_WMDE: thanks for your help :) [12:16:08] np :) [12:16:12] thanks for cleaning up the config! ^^ [12:16:46] (03CR) 10Majavah: k8s: add module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [12:17:08] (03PS2) 10Arturo Borrero Gonzalez: toolforge: grid: always install latest jobutils package [puppet] - 10https://gerrit.wikimedia.org/r/761309 [12:17:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: make sure no webservices are left started [puppet] - 10https://gerrit.wikimedia.org/r/761304 (owner: 10Arturo Borrero Gonzalez) [12:17:35] (03PS11) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [12:19:20] (03PS2) 10Jbond: P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 [12:19:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 (owner: 10Arturo Borrero Gonzalez) [12:19:41] (03CR) 10Volans: "The general structure seems sane to me, with the accessor of a kubernetes object that can return pods and nodes to act on. I would like to" [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [12:19:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: grid: always install latest jobutils package [puppet] - 10https://gerrit.wikimedia.org/r/761309 (owner: 10Arturo Borrero Gonzalez) [12:21:10] (03PS1) 104nn1l2: sawikisource: Add audio book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761310 (https://phabricator.wikimedia.org/T282970) [12:21:30] (03PS3) 10Jbond: P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 [12:21:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:41] B&C still open? [12:25:05] yes [12:25:29] (03PS4) 10Jbond: P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 [12:25:53] (03PS1) 10Marostegui: Revert "db2076,db2095: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761096 [12:26:23] I added a change to https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_February_9 [12:27:06] (03PS5) 10Jbond: P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 [12:27:12] (03CR) 10Marostegui: [C: 03+2] Revert "db2076,db2095: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761096 (owner: 10Marostegui) [12:27:47] alright *looks* [12:27:48] (03CR) 10jerkins-bot: [V: 04-1] P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 (owner: 10Jbond) [12:27:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33649/console" [puppet] - 10https://gerrit.wikimedia.org/r/761307 (owner: 10Jbond) [12:28:48] (03PS1) 10Majavah: openstack pdns_auth: fix prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/761313 (https://phabricator.wikimedia.org/T300254) [12:29:25] (03CR) 10jerkins-bot: [V: 04-1] openstack pdns_auth: fix prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/761313 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [12:30:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Namespace 104 is already taken according to https://sa.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&formatversion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761310 (https://phabricator.wikimedia.org/T282970) (owner: 104nn1l2) [12:31:47] (03PS2) 10Majavah: openstack pdns_auth: fix prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/761313 (https://phabricator.wikimedia.org/T300254) [12:32:18] (03CR) 10Jbond: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/761307 (owner: 10Jbond) [12:32:41] (03PS2) 104nn1l2: sawikisource: Add audio book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761310 (https://phabricator.wikimedia.org/T282970) [12:33:44] (03CR) 104nn1l2: sawikisource: Add audio book namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761310 (https://phabricator.wikimedia.org/T282970) (owner: 104nn1l2) [12:34:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] sawikisource: Add audio book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761310 (https://phabricator.wikimedia.org/T282970) (owner: 104nn1l2) [12:35:22] (03PS6) 10Jbond: P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 [12:37:45] (03CR) 10Majavah: [V: 03+1] "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33651" [puppet] - 10https://gerrit.wikimedia.org/r/761313 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [12:37:58] (03Merged) 10jenkins-bot: sawikisource: Add audio book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761310 (https://phabricator.wikimedia.org/T282970) (owner: 104nn1l2) [12:38:45] nn1l2: the change is on mwdebug1001, please test [12:39:06] LGTM [12:39:44] ok [12:40:10] syncing [12:40:53] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:761310|sawikisource: Add audio book namespace (T282970)]] (duration: 00m 50s) [12:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:58] T282970: Create AUDIO BOOK namespace in Sanskrit Wikisource - https://phabricator.wikimedia.org/T282970 [12:41:15] !log UTC morning backport+config window done [12:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:33] I’ll keep an eye on logstash, but no more patches this window, I need to eat before my next meeting :D [12:41:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:42:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:44] Thanks, Lucas [12:44:57] np :) [12:46:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:16] (03PS1) 10Majavah: dnsrecursor: add built-in webserver support [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) [12:51:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "+1 for the k8s API calls. They seem reasonable to me, and pretty similar to what we would use in related operations in our k8s clusters." [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [12:52:36] (03PS2) 10Majavah: dnsrecursor: add built-in webserver support [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) [12:53:14] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: add built-in webserver support [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [12:54:20] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [12:56:10] (03PS9) 10Filippo Giunchedi: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:56:20] (03PS1) 10Reedy: Pin league/oauth2-server to hash [vendor] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761316 [12:56:22] (03PS1) 10Reedy: Downgrading symfony/console (v5.4.3 => v5.4.2) [vendor] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761317 (https://phabricator.wikimedia.org/T301320) [12:57:08] (03PS10) 10Filippo Giunchedi: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:58:16] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: prefix cron job with jsub [puppet] - 10https://gerrit.wikimedia.org/r/761318 [12:58:43] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you Zabe, I've made a couple of minor changes and I'll merge this" [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:58:47] (03CR) 10Volans: "reply to question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [13:00:00] (03CR) 10Reedy: [C: 03+2] Pin league/oauth2-server to hash [vendor] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761316 (owner: 10Reedy) [13:00:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: prefix cron job with jsub [puppet] - 10https://gerrit.wikimedia.org/r/761318 (owner: 10Arturo Borrero Gonzalez) [13:00:45] (03CR) 10Reedy: [C: 03+2] Downgrading symfony/console (v5.4.3 => v5.4.2) [vendor] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761317 (https://phabricator.wikimedia.org/T301320) (owner: 10Reedy) [13:03:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "This LGTM, remember to clean up:" [puppet] - 10https://gerrit.wikimedia.org/r/761288 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [13:03:21] (03PS3) 10Zabe: graphite: whisper_cleanup: remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/751471 (https://phabricator.wikimedia.org/T273673) [13:05:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] k8s: add module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [13:05:42] (03PS3) 10Jbond: dnsrecursor: add built-in webserver support [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [13:06:18] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: add built-in webserver support [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [13:07:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33654/console" [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [13:10:07] (03CR) 10Muehlenhoff: "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/761307 (owner: 10Jbond) [13:10:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/761308 (owner: 10Jbond) [13:11:16] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742686 (owner: 10Muehlenhoff) [13:13:30] (03Merged) 10jenkins-bot: Pin league/oauth2-server to hash [vendor] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761316 (owner: 10Reedy) [13:13:59] (03Merged) 10jenkins-bot: Downgrading symfony/console (v5.4.3 => v5.4.2) [vendor] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761317 (https://phabricator.wikimedia.org/T301320) (owner: 10Reedy) [13:14:08] jouncebot: nowandnext [13:14:08] No deployments scheduled for the next 5 hour(s) and 45 minute(s) [13:14:08] In 5 hour(s) and 45 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T1900) [13:14:08] In 5 hour(s) and 45 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T1900) [13:17:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:25] (03CR) 10Filippo Giunchedi: [C: 03+1] Add Cumin aliases for edge sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742686 (owner: 10Muehlenhoff) [13:18:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:18:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [13:19:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [13:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298554)', diff saved to https://phabricator.wikimedia.org/P20412 and previous config saved to /var/cache/conftool/dbconfig/20220209-131938-ladsgroup.json [13:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:43] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [13:26:25] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:27:48] (03PS4) 10Jbond: dnsrecursor: add built-in webserver support [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [13:27:50] (03PS1) 10Jbond: rake_modules: remove puppet_lint_fix from tasklist [puppet] - 10https://gerrit.wikimedia.org/r/761322 [13:28:51] (03PS1) 10Muehlenhoff: puppetboard: Set cookie_secure to On: instead of Auto: [puppet] - 10https://gerrit.wikimedia.org/r/761324 [13:29:28] (03CR) 10Jbond: [C: 03+2] rake_modules: remove puppet_lint_fix from tasklist [puppet] - 10https://gerrit.wikimedia.org/r/761322 (owner: 10Jbond) [13:32:10] (03PS1) 10Marostegui: change_ir_value_T300382.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761325 (https://phabricator.wikimedia.org/T300382) [13:34:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761324 (owner: 10Muehlenhoff) [13:36:37] !log reedy@deploy1002 Started scap: Downgrading symfony/console \(v5.4.3 => v5.4.2\) T301320 [13:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:42] T301320: shell.php commands broken - https://phabricator.wikimedia.org/T301320 [13:38:11] (03CR) 10Ladsgroup: [C: 03+1] change_ir_value_T300382.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761325 (https://phabricator.wikimedia.org/T300382) (owner: 10Marostegui) [13:38:12] !log reedy@deploy1002 Finished scap: Downgrading symfony/console \(v5.4.3 => v5.4.2\) T301320 (duration: 01m 34s) [13:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:43] (03PS2) 10Ladsgroup: admin: Completely remove sc-admins group [puppet] - 10https://gerrit.wikimedia.org/r/761001 [13:38:48] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Completely remove sc-admins group [puppet] - 10https://gerrit.wikimedia.org/r/761001 (owner: 10Ladsgroup) [13:38:50] (03CR) 10Marostegui: [V: 03+2 C: 03+2] change_ir_value_T300382.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761325 (https://phabricator.wikimedia.org/T300382) (owner: 10Marostegui) [13:39:21] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10jnuche) [13:42:45] (03PS1) 10Reedy: Pin league/oauth2-server to hash [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761329 [13:42:47] (03PS1) 10Reedy: Downgrading symfony/console (v5.4.3 => v5.4.2) [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761330 (https://phabricator.wikimedia.org/T301320) [13:43:00] (03CR) 10Reedy: [C: 03+2] Pin league/oauth2-server to hash [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761329 (owner: 10Reedy) [13:43:05] (03CR) 10Reedy: [C: 03+2] Downgrading symfony/console (v5.4.3 => v5.4.2) [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761330 (https://phabricator.wikimedia.org/T301320) (owner: 10Reedy) [13:44:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761324 (owner: 10Muehlenhoff) [13:47:20] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/761279 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:48:04] !log update scap to 4.3.1 on all hosts - T301307 [13:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:09] T301307: Deploy Scap version 4.3.1 - https://phabricator.wikimedia.org/T301307 [13:50:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:52:53] (03CR) 10Ema: [C: 03+2] prometheus::ops: remove atskafka job definition [puppet] - 10https://gerrit.wikimedia.org/r/761288 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [13:53:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Migrate to bullseye (T300510) [13:53:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Migrate to bullseye (T300510) [13:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:01] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [13:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [13:55:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [13:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T300510)', diff saved to https://phabricator.wikimedia.org/P20414 and previous config saved to /var/cache/conftool/dbconfig/20220209-135515-ladsgroup.json [13:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2126.codfw.wmnet with OS bullseye [13:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:54] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin aliases for edge sites [puppet] - 10https://gerrit.wikimedia.org/r/742686 (owner: 10Muehlenhoff) [13:58:56] (03CR) 10jerkins-bot: [V: 04-1] Pin league/oauth2-server to hash [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761329 (owner: 10Reedy) [13:58:58] (03CR) 10jerkins-bot: [V: 04-1] Downgrading symfony/console (v5.4.3 => v5.4.2) [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761330 (https://phabricator.wikimedia.org/T301320) (owner: 10Reedy) [14:00:50] (03CR) 10Reedy: [C: 03+2] Pin league/oauth2-server to hash [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761329 (owner: 10Reedy) [14:00:58] (03CR) 10Reedy: [C: 03+2] Downgrading symfony/console (v5.4.3 => v5.4.2) [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761330 (https://phabricator.wikimedia.org/T301320) (owner: 10Reedy) [14:03:38] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10MoritzMuehlenhoff) 05Resolved→03Open Is another restart of the logstash service needed? On apifeatureusage1001 logstash still uses t... [14:07:42] (03CR) 10MMandere: [C: 03+1] "LGTM! Verified the PHID, matches what we have under ops-drmrs project in phabricator." [puppet] - 10https://gerrit.wikimedia.org/r/760614 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [14:09:02] (03PS1) 10Majavah: P:toolforge: remove clush access [puppet] - 10https://gerrit.wikimedia.org/r/761337 (https://phabricator.wikimedia.org/T298191) [14:10:22] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33655/console" [puppet] - 10https://gerrit.wikimedia.org/r/761337 (https://phabricator.wikimedia.org/T298191) (owner: 10Majavah) [14:10:47] (03CR) 10Majavah: P:toolforge: remove clush access [puppet] - 10https://gerrit.wikimedia.org/r/761337 (https://phabricator.wikimedia.org/T298191) (owner: 10Majavah) [14:15:42] 10Puppet, 10Infrastructure-Foundations: update hiera order ii produciton environment - https://phabricator.wikimedia.org/T301349 (10jbond) p:05Triage→03Medium [14:15:58] (03PS1) 10Jbond: P:base::production: update hiera preference [puppet] - 10https://gerrit.wikimedia.org/r/761339 [14:16:00] (03PS1) 10Jbond: P:base::production: update hiera preference [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) [14:16:29] (03PS2) 10Jbond: P:base::production: update hiera preference [puppet] - 10https://gerrit.wikimedia.org/r/761339 (https://phabricator.wikimedia.org/T301349) [14:17:48] (03Merged) 10jenkins-bot: Pin league/oauth2-server to hash [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761329 (owner: 10Reedy) [14:17:51] (03Merged) 10jenkins-bot: Downgrading symfony/console (v5.4.3 => v5.4.2) [vendor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761330 (https://phabricator.wikimedia.org/T301320) (owner: 10Reedy) [14:18:22] (03CR) 10Jbond: [C: 03+2] P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 (owner: 10Jbond) [14:18:47] (03PS7) 10Jbond: P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 [14:19:01] (03CR) 10Jbond: [C: 03+2] puppetboard-samltest: create new host for testing [dns] - 10https://gerrit.wikimedia.org/r/761308 (owner: 10Jbond) [14:20:45] !log reedy@deploy1002 Started scap: Downgrading symfony/console (v5.4.3 => v5.4.2) T301320 [14:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:51] T301320: shell.php commands broken - https://phabricator.wikimedia.org/T301320 [14:22:17] !log reedy@deploy1002 Finished scap: Downgrading symfony/console (v5.4.3 => v5.4.2) T301320 (duration: 01m 31s) [14:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:36] (03CR) 10Jbond: [C: 03+2] puppetboard: Set cookie_secure to On: instead of Auto: [puppet] - 10https://gerrit.wikimedia.org/r/761324 (owner: 10Muehlenhoff) [14:22:40] (03CR) 10Jbond: [C: 03+2] P:puppetbard: add new saml based vhost for puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/761307 (owner: 10Jbond) [14:22:54] (03PS2) 10Jbond: puppetboard: Set cookie_secure to On: instead of Auto: [puppet] - 10https://gerrit.wikimedia.org/r/761324 (owner: 10Muehlenhoff) [14:25:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:25:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:36] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:29:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2126.codfw.wmnet with OS bullseye [14:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:15] (03PS2) 10Jbond: P:base::production: update hiera preference [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) [14:31:38] (03PS3) 10Jbond: P:base::production: update hiera preference role/site vs site [puppet] - 10https://gerrit.wikimedia.org/r/761339 (https://phabricator.wikimedia.org/T301349) [14:31:47] (03PS3) 10Jbond: P:base::production: update hiera preference public vs private [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) [14:32:04] (03PS4) 10Jbond: P:base::production: update hiera preference public vs private [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) [14:35:20] (03PS1) 10Elukey: Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761345 (https://phabricator.wikimedia.org/T300744) [14:35:24] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:36:35] (03PS2) 10Elukey: Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761345 (https://phabricator.wikimedia.org/T300744) [14:36:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T300510)', diff saved to https://phabricator.wikimedia.org/P20415 and previous config saved to /var/cache/conftool/dbconfig/20220209-143642-ladsgroup.json [14:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:47] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [14:37:28] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:39:32] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:40:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298554)', diff saved to https://phabricator.wikimedia.org/P20416 and previous config saved to /var/cache/conftool/dbconfig/20220209-144008-ladsgroup.json [14:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:13] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [14:41:59] (03PS1) 10Marostegui: add_tl_target_id_T300775.py: Change downtime duration [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761349 (https://phabricator.wikimedia.org/T300775) [14:43:13] !log prometheus: remove atskafka target files - '/srv/prometheus/ops/targets/atskafka_*' T247497 [14:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:17] T247497: Test atskafka deployment - https://phabricator.wikimedia.org/T247497 [14:44:21] (03CR) 10Marostegui: [V: 03+2 C: 03+2] add_tl_target_id_T300775.py: Change downtime duration [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761349 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [14:47:09] (03PS3) 10Elukey: Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761345 (https://phabricator.wikimedia.org/T300744) [14:47:48] (03PS1) 10Jbond: idp: fix validate url [puppet] - 10https://gerrit.wikimedia.org/r/761350 [14:47:52] (03CR) 10Elukey: [C: 04-1] "too many files added sigh" [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761345 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:48:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: fix validate url [puppet] - 10https://gerrit.wikimedia.org/r/761350 (owner: 10Jbond) [14:50:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 306): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33656/console" [puppet] - 10https://gerrit.wikimedia.org/r/761339 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond) [14:51:16] (03Abandoned) 10Elukey: Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761345 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:53:29] (03Restored) 10Elukey: Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761345 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:55:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P20418 and previous config saved to /var/cache/conftool/dbconfig/20220209-145513-ladsgroup.json [14:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:54] (03CR) 10Vgutierrez: [C: 03+1] Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [14:57:49] (03PS1) 10Majavah: hieradata: Remove data for non-existent roles [puppet] - 10https://gerrit.wikimedia.org/r/761355 (https://phabricator.wikimedia.org/T296533) [14:59:04] (03PS2) 10Btullis: Change the date at which the Movement Metrics tasks run [puppet] - 10https://gerrit.wikimedia.org/r/757679 (https://phabricator.wikimedia.org/T295733) [15:00:19] (03CR) 10JMeybohm: [C: 03+2] Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [15:05:51] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10thcipriani) >>! In T301241#7696250, @LSobanski wrote: > Is gitlab-roots actually needed? Maybe the group does something else than the name would suggest but I thought root acces... [15:06:31] !log imported jenkins 2.319.3 to thirdparty/ci T301361 [15:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:21] (03PS1) 10JMeybohm: Move k8s-ingress-staging to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/761357 (https://phabricator.wikimedia.org/T300740) [15:10:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P20419 and previous config saved to /var/cache/conftool/dbconfig/20220209-151017-ladsgroup.json [15:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:23] (03PS1) 10Urbanecm: Track changes of growthexperiments-mentor-away-timestamp [extensions/WikimediaEvents] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761098 (https://phabricator.wikimedia.org/T280307) [15:10:44] (03CR) 10Btullis: [C: 03+2] Change the date at which the Movement Metrics tasks run [puppet] - 10https://gerrit.wikimedia.org/r/757679 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis) [15:10:49] (03PS1) 10Urbanecm: Track changes of growthexperiments-mentor-away-timestamp [extensions/WikimediaEvents] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761099 (https://phabricator.wikimedia.org/T280307) [15:17:02] (03PS2) 10Majavah: hieradata: remove cloud-cumin-01,02 [puppet] - 10https://gerrit.wikimedia.org/r/757026 (https://phabricator.wikimedia.org/T255980) [15:19:01] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:10] (03PS1) 10Ssingh: dnsdist: update AAAA records for check.wikimedia-dns.org [puppet] - 10https://gerrit.wikimedia.org/r/761362 (https://phabricator.wikimedia.org/T301165) [15:20:36] (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-staging to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/761357 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [15:21:03] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:24:10] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33658/console" [puppet] - 10https://gerrit.wikimedia.org/r/761362 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:24:41] (03PS1) 10Ssingh: wikimedia-dns.org: add AAAA records for Wikidough [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) [15:25:08] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 67 connections established with conf2004.codfw.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [15:25:21] (03PS1) 10Ssingh: Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) [15:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298554)', diff saved to https://phabricator.wikimedia.org/P20420 and previous config saved to /var/cache/conftool/dbconfig/20220209-152522-ladsgroup.json [15:25:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:25:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:29] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [15:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:32] pybal is me [15:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:13] (03CR) 10jerkins-bot: [V: 04-1] Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:26:14] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.69:30443]) https://wikitech.wikimedia.org/wiki/PyBal [15:26:54] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 71 connections established with conf1004.eqiad.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal [15:26:59] wait...what.. [15:27:01] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 87 connections established with conf2004.codfw.wmnet:4001 (min=88) https://wikitech.wikimedia.org/wiki/PyBal [15:28:01] that's expected [15:28:10] I see thanks [15:28:14] puppet ran on those two [15:28:19] and pybal hasn't been restarted yet :) [15:28:35] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10Arnoldokoth) Hey @Ottomata Could you kindly confirm the appropriate group for this and also approve the same so we could go ahead with it? [15:28:37] you can stop sweating jayme ;P [15:28:48] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10Arnoldokoth) 05Open→03In progress [15:28:48] I tend to just expect IPVS diff :D [15:29:04] (03PS2) 10Ssingh: Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) [15:29:25] and be always surprised by the etcd ones...(they are mentioned in the doc's as well, though) [15:29:39] (03CR) 10Ssingh: "Additional confirmation: https://netbox.wikimedia.org/ipam/prefixes/515/ip-addresses/" [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:29:43] (03PS1) 10Urbanecm: Mentor dashboard: Mark mentor-tools as beta [extensions/GrowthExperiments] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761100 (https://phabricator.wikimedia.org/T280307) [15:29:58] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 119 connections established with conf1004.eqiad.wmnet:4001 (min=120) https://wikitech.wikimedia.org/wiki/PyBal [15:30:07] (03PS1) 10Urbanecm: Mentor dashboard: Mark mentor-tools as beta [extensions/GrowthExperiments] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761101 (https://phabricator.wikimedia.org/T280307) [15:30:30] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.69:30443]) https://wikitech.wikimedia.org/wiki/PyBal [15:30:40] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:30:41] (03CR) 10jerkins-bot: [V: 04-1] Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:30:59] !log restarting pybal on lvs2010,lvs1020 - T300740 [15:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:04] T300740: Provide a convenient way to connect to services in kubernetes staging clusters - https://phabricator.wikimedia.org/T300740 [15:31:04] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.69:30443]) https://wikitech.wikimedia.org/wiki/PyBal [15:31:22] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.69:30443]) https://wikitech.wikimedia.org/wiki/PyBal [15:32:16] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 88 connections established with conf2004.codfw.wmnet:4001 (min=88) https://wikitech.wikimedia.org/wiki/PyBal [15:34:24] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value [15:34:24] /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:34:36] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 120 connections established with conf1004.eqiad.wmnet:4001 (min=120) https://wikitech.wikimedia.org/wiki/PyBal [15:37:01] vgutierrez: AIUI, the new service does not have any backends. Would you mind to confirm? [15:37:36] 10.2.1.69 / 10.2.2.69 [15:39:56] that's expected [15:40:08] https://www.irccloud.com/pastebin/FC5n9P21/ [15:40:30] you need to administratively pool them with confctl [15:40:53] (and don't forget to give them non-zero weights while you're at it!) [15:41:16] indeed [15:41:46] ah, I see. This is because it's the first entry using cluster=kubernetes-staging,service=kubesvc I guess [15:42:07] thanks! [15:42:21] indeed [15:43:10] (03CR) 10Jbond: [C: 03+2] "LGTM, could you also create a CR to add https://phabricator.wikimedia.org/P17881 to utils" [puppet] - 10https://gerrit.wikimedia.org/r/761355 (https://phabricator.wikimedia.org/T296533) (owner: 10Majavah) [15:43:31] !log jayme@cumin1001 conftool action : set/pooled=yes:weight=10; selector: cluster=kubernetes-staging,service=kubesvc [15:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:34] PROBLEM - Confd template for /srv/config-master/pybal/codfw/k8s-ingress-staging on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/k8s-ingress-staging https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:44:16] !log change puppet hiera prefernce site vs site/role gerrit:761339 [15:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:20] (03CR) 10Jbond: [C: 03+2] P:base::production: update hiera preference role/site vs site [puppet] - 10https://gerrit.wikimedia.org/r/761339 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond) [15:44:41] https://www.irccloud.com/pastebin/sIreK0gz/ [15:44:44] there you go :) [15:44:51] (03PS1) 10Urbanecm: Mentor dashboard: Move icons from the top edge [extensions/GrowthExperiments] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761102 [15:45:09] yep, looks ipvsadm looks better now :) [15:45:34] (03PS1) 10Ottomata: airflow - add deployment dir to PYTHONPATH [puppet] - 10https://gerrit.wikimedia.org/r/761371 [15:45:45] TCP 10.2.2.69:30443 wrr [15:45:45] -> 10.64.16.55:30443 Route 10 0 0 [15:45:45] -> 10.64.48.106:30443 Route 10 0 0 [15:45:51] (03PS3) 10Ssingh: Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) [15:45:51] neat :) [15:46:05] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10Ottomata) Hello! Yes, the group is analytics-research-admins, and approved! [15:46:17] (03CR) 10jerkins-bot: [V: 04-1] airflow - add deployment dir to PYTHONPATH [puppet] - 10https://gerrit.wikimedia.org/r/761371 (owner: 10Ottomata) [15:46:32] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:46:32] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:46:40] should I do something about the confd template alert or just wait for it to heal? [15:47:04] (03CR) 10Ssingh: "I made this match the schema in anycast_neighbors:schemas/definitions/device-generic.schema but I am not sure if this is actually desired." [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:47:06] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33659/console" [puppet] - 10https://gerrit.wikimedia.org/r/761371 (owner: 10Ottomata) [15:47:22] (03PS2) 10Ottomata: airflow - add deployment dir to PYTHONPATH [puppet] - 10https://gerrit.wikimedia.org/r/761371 [15:48:21] (03PS1) 10Ssingh: bird: update vips_filter for Wikidough's IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) [15:48:23] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33660/console" [puppet] - 10https://gerrit.wikimedia.org/r/761371 (owner: 10Ottomata) [15:49:23] (03PS1) 10Ssingh: hiera: add IPv6 support to Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) [15:50:06] (03CR) 10jerkins-bot: [V: 04-1] hiera: add IPv6 support to Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:50:56] (03PS2) 10Ssingh: hiera: add IPv6 support to Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) [15:51:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33661/console" [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:53:10] (03PS3) 10Ottomata: airflow - add deployment dir to PYTHONPATH [puppet] - 10https://gerrit.wikimedia.org/r/761371 [15:53:27] (03PS3) 10Jbond: hieradata: move prometheus_nodes to WMCS role-based hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [15:54:19] (03PS1) 10Ottomata: airflow - absent analytics-test instance, we will rename to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/761374 [15:54:29] (03CR) 10Ottomata: [C: 03+2] airflow - add deployment dir to PYTHONPATH [puppet] - 10https://gerrit.wikimedia.org/r/761371 (owner: 10Ottomata) [15:54:31] (03CR) 10Ottomata: [V: 03+2 C: 03+2] airflow - add deployment dir to PYTHONPATH [puppet] - 10https://gerrit.wikimedia.org/r/761371 (owner: 10Ottomata) [15:54:39] (03CR) 10Ottomata: [V: 03+2 C: 03+2] airflow - absent analytics-test instance, we will rename to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/761374 (owner: 10Ottomata) [15:54:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33662/console" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [15:55:35] RECOVERY - Confd template for /srv/config-master/pybal/codfw/k8s-ingress-staging on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:56:16] !log restarting pybal on lvs1015,lvs2009 - T300740 [15:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:20] T300740: Provide a convenient way to connect to services in kubernetes staging clusters - https://phabricator.wikimedia.org/T300740 [15:57:00] !log ran sudo rm /var/run/confd-template/.k8s-ingress-staging*.err on puppetmaster2001 - T300740 [15:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:03] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 68 connections established with conf2004.codfw.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [15:58:11] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:58:31] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:58:41] (03PS5) 10Jbond: P:base::production: update hiera preference public vs private [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) [15:59:43] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 72 connections established with conf1004.eqiad.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal [16:04:49] (03PS1) 10Ottomata: airflow - Rename analytics-test to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/761376 [16:04:51] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10LSobanski) Thanks for the explanation. Approved. [16:06:35] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33664/console" [puppet] - 10https://gerrit.wikimedia.org/r/761376 (owner: 10Ottomata) [16:08:21] (03Abandoned) 10JMeybohm: Allow to configure a different port for ProxyFetch monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/759749 (https://phabricator.wikimedia.org/T301137) (owner: 10JMeybohm) [16:08:41] (03PS2) 10Ottomata: airflow - Rename analytics-test to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/761376 [16:09:45] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33665/console" [puppet] - 10https://gerrit.wikimedia.org/r/761376 (owner: 10Ottomata) [16:09:57] (03CR) 10Ottomata: [V: 03+1 C: 03+2] airflow - Rename analytics-test to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/761376 (owner: 10Ottomata) [16:10:22] (03Abandoned) 10Elukey: Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761345 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:11:32] (03PS1) 10JMeybohm: Move k8s-ingress-staging to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/761380 (https://phabricator.wikimedia.org/T300740) [16:14:53] (03PS1) 10Cwhite: logstash: set JAVA_HOME to system java runtime [puppet] - 10https://gerrit.wikimedia.org/r/761384 (https://phabricator.wikimedia.org/T300853) [16:15:27] (03PS1) 10Elukey: Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761385 (https://phabricator.wikimedia.org/T300744) [16:16:13] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@ddd10b4]: (no justification provided) [16:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:34] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@ddd10b4]: (no justification provided) (duration: 00m 20s) [16:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:22] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@ddd10b4]: (no justification provided) [16:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:25] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@ddd10b4]: (no justification provided) (duration: 00m 03s) [16:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:50] (03CR) 10Cwhite: [C: 03+2] "PCC ok https://puppet-compiler.wmflabs.org/pcc-worker1003/33666/" [puppet] - 10https://gerrit.wikimedia.org/r/761384 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [16:19:57] (03PS1) 10JMeybohm: Add discovery record k8s-ingress-staging [dns] - 10https://gerrit.wikimedia.org/r/761387 (https://phabricator.wikimedia.org/T300740) [16:21:15] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-staging,name=eqiad [16:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:36] (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-staging to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/761380 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [16:23:07] (03PS1) 10Muehlenhoff: Set cookie_secure: On for superset [puppet] - 10https://gerrit.wikimedia.org/r/761388 [16:23:09] (03PS1) 10Muehlenhoff: turnilo: Set cookie_secure to On [puppet] - 10https://gerrit.wikimedia.org/r/761389 [16:23:11] (03PS1) 10Muehlenhoff: profile::idp::client::httpd::site: Default cookie_secure to On [puppet] - 10https://gerrit.wikimedia.org/r/761390 [16:23:40] PROBLEM - Check systemd state on elastic1041 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:47] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10colewhite) 05Open→03Resolved >>! In T300853#7696996, @MoritzMuehlenhoff wrote: > Is another restart of the logstash service needed?... [16:27:29] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10MoritzMuehlenhoff) Looks good, thanks! [16:29:07] (03CR) 10Giuseppe Lavagetto: k8s: add module (0317 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [16:30:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:30:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298554)', diff saved to https://phabricator.wikimedia.org/P20422 and previous config saved to /var/cache/conftool/dbconfig/20220209-163102-ladsgroup.json [16:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:12] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [16:31:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 305 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33663/console" [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond) [16:31:55] (03CR) 10Giuseppe Lavagetto: k8s: add module (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [16:32:40] (03PS1) 10JMeybohm: Move k8s-ingress-staging to state: poduction [puppet] - 10https://gerrit.wikimedia.org/r/761392 (https://phabricator.wikimedia.org/T300740) [16:32:44] (03PS1) 10JMeybohm: Add k8s-ingress-staging to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/761393 (https://phabricator.wikimedia.org/T300740) [16:33:01] (03CR) 10Razzi: "Adding Andrew Bogott as a reviewer, since we worked on this in the past." [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [16:36:28] (03PS4) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [16:42:39] (03PS4) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) [16:48:24] (03CR) 10Elukey: [C: 03+1] Add discovery record k8s-ingress-staging [dns] - 10https://gerrit.wikimedia.org/r/761387 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [16:48:48] (03CR) 10Elukey: [C: 03+1] Move k8s-ingress-staging to state: poduction [puppet] - 10https://gerrit.wikimedia.org/r/761392 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [16:48:57] (03PS1) 10Majavah: utils: add script to audit role hiera files [puppet] - 10https://gerrit.wikimedia.org/r/761397 (https://phabricator.wikimedia.org/T296533) [16:49:08] (03CR) 10Elukey: [C: 03+1] Add k8s-ingress-staging to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/761393 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [16:50:56] (03CR) 10Ssingh: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33661/doh1002.wikimedia.org/index.html looks good. We have bird6.service, hc-vip-wikime" [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [16:51:07] (03CR) 10JMeybohm: [C: 03+2] Move k8s-ingress-staging to state: poduction [puppet] - 10https://gerrit.wikimedia.org/r/761392 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [16:52:21] (03PS8) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) [16:52:46] (03PS9) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) [16:54:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [16:54:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tox: Run mypy only in the source directory and exclude .eggs from flake8 [software/httpbb] - 10https://gerrit.wikimedia.org/r/755796 (owner: 10RLazarus) [16:55:12] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10Arnoldokoth) Okay, great. Thank you! @Ottomata [16:55:55] (03CR) 10JMeybohm: [C: 03+1] kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) (owner: 10AOkoth) [16:56:10] (03CR) 10AOkoth: [C: 03+2] kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) (owner: 10AOkoth) [17:00:20] 10ops-eqiad, 10DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T300820 (10Cmjohnson) @BBlack no, I did not make any changes. Maybe it was an anomaly. Do want to leave it for now and see if it happens again? [17:00:34] 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Dzahn) confirmed working: ` nmap -6 gitlab.wikimedia.org -p 443 .. PORT STATE SERVICE 443/tcp open https ` and also A+ now on https://www.ssllabs.com/ssltest/analyze.html?d=git... [17:03:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] k8s: add module (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [17:04:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:base::production: update hiera preference public vs private [puppet] - 10https://gerrit.wikimedia.org/r/761340 (https://phabricator.wikimedia.org/T301349) (owner: 10Jbond) [17:05:01] (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-staging to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/761393 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [17:05:03] !log joal@deploy1002 Started deploy [analytics/refinery@55b229b]: Regular analytics weekly train [analytics/refinery@55b229b] [17:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:08] (03CR) 10Dzahn: [C: 03+2] Remove upstart/sysvinit file [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759525 (owner: 10Alexandros Kosiaris) [17:05:10] (03CR) 10JMeybohm: [C: 03+2] Add discovery record k8s-ingress-staging [dns] - 10https://gerrit.wikimedia.org/r/761387 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [17:06:19] (03CR) 10Dzahn: [C: 03+2] Bump requirements to match 1.8.14 upstream [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759526 (owner: 10Alexandros Kosiaris) [17:07:26] !log ran sudo rm /var/run/confd-template/.k8s-ingress-staging*.err on puppetmaster1001 - T300740 [17:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:30] T300740: Provide a convenient way to connect to services in kubernetes staging clusters - https://phabricator.wikimedia.org/T300740 [17:09:06] (03PS2) 10Herron: watchrat: route alerts to irc and noc@ [puppet] - 10https://gerrit.wikimedia.org/r/761064 (https://phabricator.wikimedia.org/T299147) [17:09:14] (03CR) 10Dzahn: [C: 03+2] Refresh local patches, drop X-Client-IP logging [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759527 (owner: 10Alexandros Kosiaris) [17:10:00] (03CR) 10Herron: watchrat: route alerts to irc and noc@ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761064 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [17:10:49] (03CR) 10Dzahn: [C: 03+2] Add execute permission to npm-cli.js [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759528 (owner: 10Alexandros Kosiaris) [17:11:08] (03CR) 10Dzahn: [C: 03+2] Bump changelog to 1.8.14 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759529 (owner: 10Alexandros Kosiaris) [17:12:47] (03CR) 10Elukey: "I saw the code review passing by, I think it is a great start but things should be improved a little before being ready to merge." [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [17:13:18] (03CR) 10Dzahn: [C: 03+2] Bump to 1.8.16 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759530 (owner: 10Alexandros Kosiaris) [17:15:28] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: remove cloud-cumin-01,02 [puppet] - 10https://gerrit.wikimedia.org/r/757026 (https://phabricator.wikimedia.org/T255980) (owner: 10Majavah) [17:17:20] (03CR) 10JMeybohm: [C: 03+1] Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761385 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [17:18:12] (03CR) 10Andrew Bogott: [C: 03+2] openstack pdns_auth: fix prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/761313 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [17:18:19] (03PS3) 10Andrew Bogott: openstack pdns_auth: fix prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/761313 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [17:22:50] (03CR) 10Andrew Bogott: [C: 03+2] dnsrecursor: add built-in webserver support [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [17:23:59] (03PS5) 10Andrew Bogott: dnsrecursor: add built-in webserver support [puppet] - 10https://gerrit.wikimedia.org/r/761315 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [17:25:07] (03CR) 10Elukey: "recheck" [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761385 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [17:27:03] !log joal@deploy1002 Finished deploy [analytics/refinery@55b229b]: Regular analytics weekly train [analytics/refinery@55b229b] (duration: 22m 00s) [17:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:12] !log joal@deploy1002 Started deploy [analytics/refinery@55b229b] (thin): Regular analytics weekly train THIN [analytics/refinery@55b229b] [17:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:19] !log joal@deploy1002 Finished deploy [analytics/refinery@55b229b] (thin): Regular analytics weekly train THIN [analytics/refinery@55b229b] (duration: 00m 07s) [17:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:37] !log joal@deploy1002 Started deploy [analytics/refinery@55b229b] (hadoop-test): Regular analytics weekly train HADOOP-TEST [analytics/refinery@55b229b] [17:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:06] !log upload rsyslog 8.2102.0-2+deb11u1+wmf1 packages to bullseye-wikimedia component/rsyslog-k8s [17:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:03] (03PS1) 10Herron: watchrat: route donate.wm.o alerts to fr-ircmail [puppet] - 10https://gerrit.wikimedia.org/r/761403 (https://phabricator.wikimedia.org/T299147) [17:37:42] !log joal@deploy1002 Finished deploy [analytics/refinery@55b229b] (hadoop-test): Regular analytics weekly train HADOOP-TEST [analytics/refinery@55b229b] (duration: 07m 04s) [17:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:12] (03PS2) 10Herron: watchrat: route donate.wm.o alerts to fr-ircmail [puppet] - 10https://gerrit.wikimedia.org/r/761403 (https://phabricator.wikimedia.org/T299147) [17:42:02] PROBLEM - SSH on wtp1027.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:43:04] (03CR) 10Herron: [C: 03+1] alertmanager: inhibit warnings when a match critical alert is firing [puppet] - 10https://gerrit.wikimedia.org/r/761285 (owner: 10Filippo Giunchedi) [17:47:20] (03CR) 10Herron: [C: 03+1] prometheus: add 'tcp' probe type [puppet] - 10https://gerrit.wikimedia.org/r/761279 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [17:49:54] 10SRE, 10vm-requests: eqiad: 3 of VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10razzi) [17:50:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:52:32] PROBLEM - Check systemd state on elastic1042 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:51] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10jhathaway) >>! In T300985#7696575, @MoritzMuehlenhoff wrote: > Good catch! It seems a little mysterious though that this problem isn't more widely reported, given that nginx is... [17:52:57] 10SRE, 10vm-requests: eqiad: 3 of VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10razzi) a:05akosiaris→03razzi [17:53:26] 10SRE, 10vm-requests: eqiad: 3 of VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10razzi) [17:55:46] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Done by Feb 23🔥): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10thcipriani) [17:56:33] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Done by Feb 23🔥): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10thcipriani) [17:59:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298554)', diff saved to https://phabricator.wikimedia.org/P20423 and previous config saved to /var/cache/conftool/dbconfig/20220209-175909-ladsgroup.json [17:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:14] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:00:25] !log copy calico debs from buster-wikimedia's component/calico-future to bullseye-wikimedia component/calico317 [18:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:40] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10Arnoldokoth) Hey @leila Could you kindly approve this? [18:04:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761388 (owner: 10Muehlenhoff) [18:04:47] (03CR) 10Jbond: [C: 03+1] turnilo: Set cookie_secure to On [puppet] - 10https://gerrit.wikimedia.org/r/761389 (owner: 10Muehlenhoff) [18:05:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761390 (owner: 10Muehlenhoff) [18:05:38] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10leila) Approved. (and thank you for your support.) [18:05:53] (03PS6) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [18:07:37] (03PS1) 10Andrew Bogott: nfs add_server: create service address with prefix rather than volume name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761429 [18:10:23] (03CR) 10Elukey: [C: 03+2] Add wmf-specific patches to Rsyslog [debs/rsyslog] (debian/bullseye-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/761385 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [18:11:29] (03CR) 10jerkins-bot: [V: 04-1] C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [18:11:56] (03CR) 10RLazarus: [C: 03+2] tox: Run mypy only in the source directory and exclude .eggs from flake8 [software/httpbb] - 10https://gerrit.wikimedia.org/r/755796 (owner: 10RLazarus) [18:12:27] (03PS2) 10Andrew Bogott: nfs add_server: create service address with prefix rather than volume name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761429 [18:12:55] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10RhinosF1) [18:13:34] (03Merged) 10jenkins-bot: tox: Run mypy only in the source directory and exclude .eggs from flake8 [software/httpbb] - 10https://gerrit.wikimedia.org/r/755796 (owner: 10RLazarus) [18:13:38] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10RhinosF1) [18:14:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P20424 and previous config saved to /var/cache/conftool/dbconfig/20220209-181413-ladsgroup.json [18:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:24] (03CR) 10RLazarus: [C: 03+2] Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [18:15:35] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) Today @jbond and I joined the office hours of #sre_observability and discussed a bit the plan for the above. We agreed to split this into 2 phas... [18:15:55] (03Merged) 10jenkins-bot: Initial deb package [software/httpbb] - 10https://gerrit.wikimedia.org/r/755764 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [18:16:51] (03PS1) 10Andrew Bogott: wmcs 'maps' project: use project-local NFS server [puppet] - 10https://gerrit.wikimedia.org/r/761430 (https://phabricator.wikimedia.org/T300694) [18:18:40] 10SRE, 10vm-requests: VM Request template (form 84) title doesn't make sense - https://phabricator.wikimedia.org/T301387 (10RhinosF1) [18:18:59] 10SRE, 10Phabricator, 10vm-requests: VM Request template (form 84) title doesn't make sense - https://phabricator.wikimedia.org/T301387 (10RhinosF1) p:05Low→03Triage [18:19:32] (03CR) 10Andrew Bogott: [C: 03+2] wmcs 'maps' project: use project-local NFS server [puppet] - 10https://gerrit.wikimedia.org/r/761430 (https://phabricator.wikimedia.org/T300694) (owner: 10Andrew Bogott) [18:28:11] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10jbond) > I was able to hit the same bug against these mirrors as well: just to clarify i'm guessing you also tested against some apache mirrors and where unable to reproduce? [18:29:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P20425 and previous config saved to /var/cache/conftool/dbconfig/20220209-182918-ladsgroup.json [18:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:02] (03PS1) 10Ppchelko: Add PHP array default settings loader benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761433 (https://phabricator.wikimedia.org/T300129) [18:30:28] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105 (10Majavah) [18:33:17] (03PS1) 10Herron: add new prometheus hosts to labs-in[4,6] [homer/public] - 10https://gerrit.wikimedia.org/r/761435 (https://phabricator.wikimedia.org/T301376) [18:35:26] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10fkaelin) This would affect the research team, especially if the stat machines are also included in this restriction. For e... [18:37:29] (03PS1) 10Majavah: ssh::client: optionally disable key puppetdb management [puppet] - 10https://gerrit.wikimedia.org/r/761437 (https://phabricator.wikimedia.org/T214427) [18:39:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33667/console" [puppet] - 10https://gerrit.wikimedia.org/r/761437 (https://phabricator.wikimedia.org/T214427) (owner: 10Majavah) [18:42:34] (03PS1) 10Andrew Bogott: cloudnfs: Add a hiera key to switch scratch hosting on or off [puppet] - 10https://gerrit.wikimedia.org/r/761438 (https://phabricator.wikimedia.org/T291405) [18:43:12] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:43:18] (03CR) 10jerkins-bot: [V: 04-1] cloudnfs: Add a hiera key to switch scratch hosting on or off [puppet] - 10https://gerrit.wikimedia.org/r/761438 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [18:44:07] (03CR) 10Jbond: [C: 03+1] "LGTM, minor optional nit. also i like using protocol number for the ip address so would vote for 2001:67c:930::53 if you are still thinkin" [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [18:44:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298554)', diff saved to https://phabricator.wikimedia.org/P20426 and previous config saved to /var/cache/conftool/dbconfig/20220209-184423-ladsgroup.json [18:44:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [18:44:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [18:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:29] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:44:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298554)', diff saved to https://phabricator.wikimedia.org/P20427 and previous config saved to /var/cache/conftool/dbconfig/20220209-184430-ladsgroup.json [18:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:36] (03PS2) 10Andrew Bogott: cloudnfs: Add a hiera key to switch scratch hosting on or off [puppet] - 10https://gerrit.wikimedia.org/r/761438 (https://phabricator.wikimedia.org/T291405) [18:46:22] (03CR) 10jerkins-bot: [V: 04-1] cloudnfs: Add a hiera key to switch scratch hosting on or off [puppet] - 10https://gerrit.wikimedia.org/r/761438 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [18:47:08] (03PS3) 10Andrew Bogott: cloudnfs: Add a hiera key to switch scratch hosting on or off [puppet] - 10https://gerrit.wikimedia.org/r/761438 (https://phabricator.wikimedia.org/T291405) [18:47:37] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: introduce function server_list_filter_exists() [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761439 [18:47:39] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: pool: introduce support for nodeset-syntax host selection [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761440 [18:47:44] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:48:50] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:50:12] (03PS4) 10Andrew Bogott: cloudnfs: Add a hiera key to switch scratch hosting on or off [puppet] - 10https://gerrit.wikimedia.org/r/761438 (https://phabricator.wikimedia.org/T291405) [18:52:34] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: pool: introduce support for nodeset-syntax host selection [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761440 (https://phabricator.wikimedia.org/T298948) [18:53:04] (03PS5) 10Andrew Bogott: cloudnfs: Add a hiera key to switch scratch hosting on or off [puppet] - 10https://gerrit.wikimedia.org/r/761438 (https://phabricator.wikimedia.org/T291405) [18:55:04] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: pool: introduce support for nodeset-syntax host selection [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761440 (https://phabricator.wikimedia.org/T298948) [18:55:13] (03CR) 10Andrew Bogott: [C: 03+2] cloudnfs: Add a hiera key to switch scratch hosting on or off [puppet] - 10https://gerrit.wikimedia.org/r/761438 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [18:55:23] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: pool: introduce support for nodeset-syntax host selection [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761440 (https://phabricator.wikimedia.org/T298948) [18:57:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack: introduce function server_list_filter_exists() [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761439 (owner: 10Arturo Borrero Gonzalez) [18:58:40] (03PS1) 10Zabe: Migrate $wmfStandardAutoPromote to $wmgStandardAutoPromote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761441 (https://phabricator.wikimedia.org/T45956) [18:59:22] jouncebot: next [18:59:22] In 0 hour(s) and 0 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T1900) [18:59:22] In 0 hour(s) and 0 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T1900) [18:59:44] (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Move icons from the top edge [extensions/GrowthExperiments] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761102 (owner: 10Urbanecm) [18:59:49] (03CR) 10Urbanecm: [C: 03+2] Track changes of growthexperiments-mentor-away-timestamp [extensions/WikimediaEvents] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761099 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [19:00:05] jeena and dancy: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T1900). [19:00:05] RoanKattouw and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T1900). [19:00:05] Urbanecm: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:15] (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Mark mentor-tools as beta [extensions/GrowthExperiments] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761100 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [19:00:25] (03CR) 10Urbanecm: [C: 03+2] Track changes of growthexperiments-mentor-away-timestamp [extensions/WikimediaEvents] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761098 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [19:00:26] Jeena said she would be starting train about 30 minutes late today. [19:00:29] (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Mark mentor-tools as beta [extensions/GrowthExperiments] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761101 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [19:00:47] dancy: I hope I will manage to finish in time though :) [19:01:14] Yes, fyi that alert was for train log triage and not train deploy [19:01:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: pool: introduce support for nodeset-syntax host selection [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761440 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [19:01:52] oh right. [19:02:12] So I'll be ready in about an hour and a half [19:02:34] (03PS1) 10RLazarus: Add missing Build-Depends entry [software/httpbb] - 10https://gerrit.wikimedia.org/r/761442 [19:02:56] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:04:34] ACKNOWLEDGEMENT - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating andrew bogott this is probably a side-effect of T300694 somehow https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:04:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:46] (03CR) 10Jbond: [C: 04-1] "lgtm but can we get a default on the class as well" [puppet] - 10https://gerrit.wikimedia.org/r/761437 (https://phabricator.wikimedia.org/T214427) (owner: 10Majavah) [19:07:30] (03PS1) 10Zabe: filebackend: migrate $wmfSwift* to $wmgSwift* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761443 (https://phabricator.wikimedia.org/T45956) [19:07:52] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [19:09:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761397 (https://phabricator.wikimedia.org/T296533) (owner: 10Majavah) [19:19:15] (03PS2) 10Majavah: ssh::client: optionally disable key puppetdb management [puppet] - 10https://gerrit.wikimedia.org/r/761437 (https://phabricator.wikimedia.org/T214427) [19:19:43] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on cloudstore1008 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:19:45] (03CR) 10Majavah: ssh::client: optionally disable key puppetdb management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761437 (https://phabricator.wikimedia.org/T214427) (owner: 10Majavah) [19:21:41] (03Merged) 10jenkins-bot: Mentor dashboard: Move icons from the top edge [extensions/GrowthExperiments] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761102 (owner: 10Urbanecm) [19:21:43] (03Merged) 10jenkins-bot: Track changes of growthexperiments-mentor-away-timestamp [extensions/WikimediaEvents] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761099 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [19:21:46] (03Merged) 10jenkins-bot: Mentor dashboard: Mark mentor-tools as beta [extensions/GrowthExperiments] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761100 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [19:21:49] (03Merged) 10jenkins-bot: Track changes of growthexperiments-mentor-away-timestamp [extensions/WikimediaEvents] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761098 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [19:21:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33672/console" [puppet] - 10https://gerrit.wikimedia.org/r/761437 (https://phabricator.wikimedia.org/T214427) (owner: 10Majavah) [19:22:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] ssh::client: optionally disable key puppetdb management [puppet] - 10https://gerrit.wikimedia.org/r/761437 (https://phabricator.wikimedia.org/T214427) (owner: 10Majavah) [19:23:11] (03PS1) 10Andrew Bogott: nfs-exportd: don't choke if the config doesn't define public volumes [puppet] - 10https://gerrit.wikimedia.org/r/761447 [19:23:41] !log [urbanecm@deploy1002 /srv/mediawiki-staging (master % u=)]$ rm v5.4.2\) # delete untracked file found in staging dir; created by Reedy, contains scap's logo [19:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:04] los [19:24:10] (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd: don't choke if the config doesn't define public volumes [puppet] - 10https://gerrit.wikimedia.org/r/761447 (owner: 10Andrew Bogott) [19:24:24] taavi fyi ssh change merged, ill leave the utils one till tomorro in case vol.ans has anything thanks [19:24:54] thanks! [19:25:12] np [19:25:23] (03PS1) 10Arturo Borrero Gonzalez: wmcs: disable dologmsg calls if cookbook is called in dry-mode [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761449 [19:26:12] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:26:42] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on cloudstore1008 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:27:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:01] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [19:28:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:28:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] add new prometheus hosts to labs-in[4,6] [homer/public] - 10https://gerrit.wikimedia.org/r/761435 (https://phabricator.wikimedia.org/T301376) (owner: 10Herron) [19:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:27] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10RobH) [19:31:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: disable dologmsg calls if cookbook is called in dry-mode [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761449 (owner: 10Arturo Borrero Gonzalez) [19:31:16] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10RobH) [19:32:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:04] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.20/extensions/WikimediaEvents/includes/PrefUpdateInstrumentation.php: a307ac4b334dd6f60fa7257db10100e18531ee89: Track changes of growthexperiments-mentor-away-timestamp (T280307) (duration: 00m 50s) [19:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:08] T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307 [19:34:58] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.20/extensions/GrowthExperiments/: 9675848: 49202e7: Deploy M2 Mentor settings module (T280307) (duration: 00m 51s) [19:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:03] M2: Confirm MediaWiki Account Link - https://phabricator.wikimedia.org/M2 [19:35:22] TIL Phab has objects starting with an M [19:37:00] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.21/extensions/WikimediaEvents/: 588fa93: Track changes of growthexperiments-mentor-away-timestamp (T280307) (duration: 00m 49s) [19:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) [19:38:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:38:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:21] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [19:43:20] (03Merged) 10jenkins-bot: Mentor dashboard: Mark mentor-tools as beta [extensions/GrowthExperiments] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761101 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [19:43:24] finally [19:43:49] (03CR) 10Daniel Kinzler: [C: 03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761433 (https://phabricator.wikimedia.org/T300129) (owner: 10Ppchelko) [19:45:11] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.21/extensions/GrowthExperiments/includes/Specials/SpecialMentorDashboard.php: 3da81ec: Mentor dashboard: Mark mentor-tools as beta (T280307) (duration: 00m 49s) [19:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:18] T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307 [19:45:44] !log UTC evening B&C window completed [19:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:50:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:03] (03CR) 10Cathal Mooney: Add Wikidough's IPv6 anycast network in esams (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [19:51:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:53] (Juniper alarm active) firing: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active - https://alerts.wikimedia.org [19:54:32] Stepping out for a bit. I'll be back in about an hour to roll the train forward if jeena hasn't made it back by then. [20:00:05] jeena and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T2000). [20:02:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) [20:06:32] RECOVERY - Check systemd state on snapshot1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:19] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, another response to John's point in line, although in general a parameter is better. The hard-coded v4 addressing is maybe preceden" [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [20:10:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298554)', diff saved to https://phabricator.wikimedia.org/P20428 and previous config saved to /var/cache/conftool/dbconfig/20220209-201052-ladsgroup.json [20:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:59] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [20:12:22] (03PS2) 10Ssingh: bird: update vips_filter for Wikidough's IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) [20:13:19] (03CR) 10Ssingh: "Thank you, both, for the review. I am happy to do a parameter; in which case I should probably update the v4 version as well." [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [20:13:39] (03CR) 10Ssingh: bird: update vips_filter for Wikidough's IPv6 address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [20:14:19] (03PS1) 10Herron: rsyslog: add 00-load_modules.conf [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) [20:15:00] (03CR) 10jerkins-bot: [V: 04-1] rsyslog: add 00-load_modules.conf [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [20:16:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Jclark-ctr) [20:17:34] (03PS2) 10Herron: rsyslog: add 00-load_modules.conf [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) [20:25:39] (03PS2) 10Herron: remove references to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) [20:25:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P20429 and previous config saved to /var/cache/conftool/dbconfig/20220209-202557-ladsgroup.json [20:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:31] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) BGP section established between QFX leaf and Dell spine 1 ` dell-spine1# show bgp ipv4 unicast summary BGP router identifier 10.0.1.13, local AS number 65030 Neighbor V AS MsgRcvd MsgSent InQ... [20:32:19] (03CR) 10Cathal Mooney: [C: 04-1] "I think we can abandon this one. Single BGP session exchanging both types of routes seems to be what the Bird templates will configure on" [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [20:35:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Jclark-ctr) [20:38:09] (03Abandoned) 10Ebernhardson: rdf-query-service: consistently suffix env vars [puppet] - 10https://gerrit.wikimedia.org/r/757996 (owner: 10Ebernhardson) [20:40:47] 10SRE, 10Beta-Cluster-Infrastructure: Enable search team SRE access to deployment-prep VMs - https://phabricator.wikimedia.org/T301408 (10bking) [20:41:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P20430 and previous config saved to /var/cache/conftool/dbconfig/20220209-204101-ladsgroup.json [20:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:10] (03PS2) 10Ebernhardson: Provide jwt secret to blazegraph for logging [puppet] - 10https://gerrit.wikimedia.org/r/761075 (https://phabricator.wikimedia.org/T293462) [20:41:46] will deploy to group1 now [20:42:11] (03CR) 10Andrew Bogott: [C: 03+2] nfs add_server: create service address with prefix rather than volume name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761429 (owner: 10Andrew Bogott) [20:43:05] 10SRE, 10Beta-Cluster-Infrastructure: Enable search team SRE access to deployment-prep VMs - https://phabricator.wikimedia.org/T301408 (10bking) a:03bking [20:44:21] (03PS1) 10Jeena Huneidi: group1 wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761463 [20:44:23] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761463 (owner: 10Jeena Huneidi) [20:45:25] (03PS3) 10Herron: remove references to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) [20:46:27] (03CR) 10Herron: remove references to centrallog2001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [20:46:29] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761463 (owner: 10Jeena Huneidi) [20:47:42] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.21 refs T300197 [20:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:46] T300197: 1.38.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T300197 [20:48:34] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.21 refs T300197 (duration: 00m 51s) [20:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:10] (03PS1) 10Bking: deployment-prep: add search team SSH pub keys [labs/private] - 10https://gerrit.wikimedia.org/r/761465 (https://phabricator.wikimedia.org/T301408) [20:51:58] (03CR) 10Majavah: [C: 04-1] "This file applies to all of Cloud VPS (which I'm not sure if you want), all additions must follow https://wikitech.wikimedia.org/wiki/Help" [labs/private] - 10https://gerrit.wikimedia.org/r/761465 (https://phabricator.wikimedia.org/T301408) (owner: 10Bking) [20:52:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:07] (03CR) 10Cwhite: [C: 03+2] opensearch: use java_home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759751 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [20:53:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:53:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:07] (03CR) 10BryanDavis: [C: 04-1] "This location makes these users Cloud VPS global roots, not just roots in the deployment-prep project. For the stated purpose, these ssh p" [labs/private] - 10https://gerrit.wikimedia.org/r/761465 (https://phabricator.wikimedia.org/T301408) (owner: 10Bking) [20:56:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298554)', diff saved to https://phabricator.wikimedia.org/P20431 and previous config saved to /var/cache/conftool/dbconfig/20220209-205606-ladsgroup.json [20:56:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [20:56:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [20:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:56:11] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [20:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298554)', diff saved to https://phabricator.wikimedia.org/P20432 and previous config saved to /var/cache/conftool/dbconfig/20220209-205619-ladsgroup.json [20:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:14] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Enable search team SRE access to deployment-prep VMs - https://phabricator.wikimedia.org/T301408 (10bd808) [20:59:36] (03CR) 10RhinosF1: deployment-prep: add search team SSH pub keys (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/761465 (https://phabricator.wikimedia.org/T301408) (owner: 10Bking) [21:00:05] jeena and dancy: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T2000). Please do the needful. [21:00:05] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220209T2100). [21:00:12] o/ [21:00:27] hi dancy [21:00:31] Does T301310 have any train implications? [21:00:31] T301310: CommonsMetadata extension is triggering a duplicate parse in commons - https://phabricator.wikimedia.org/T301310 [21:00:37] looking [21:00:43] problem reported about wmf.20. [21:01:50] (03CR) 10Majavah: [C: 04-1] deployment-prep: add search team SSH pub keys (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/761465 (https://phabricator.wikimedia.org/T301408) (owner: 10Bking) [21:01:54] My guess is no but I wanted to throw the question out for those who might be lurking (E.g. ladsgroup) [21:01:55] I've already deployed but doesn't seem like it should stop the train considering what would we be rolling back to? [21:02:11] also it hasn't been added as a train blocker [21:02:18] Carry on! [21:02:22] :P [21:02:40] thanks for checking in :) [21:02:54] Amir1: ^ [21:03:12] sup [21:03:26] let me read [21:03:36] just wanted to make sure you saw, I don't know if you highlight on "ladsgroup" or not :D [21:03:49] yeah yeah, it's wmf.20, I don't think it should block the train [21:04:00] I actually should highlight it [21:04:25] (not sure though, I think I used to do but then I got pinged constantly in #wikimedia-dev) [21:04:33] (03CR) 10RhinosF1: deployment-prep: add search team SSH pub keys (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/761465 (https://phabricator.wikimedia.org/T301408) (owner: 10Bking) [21:04:55] I'm trying to debug/bisect [21:05:01] thx jeez/rzl/Amir1! [21:05:04] *jeena [21:05:08] jeez [21:05:14] hehe [21:05:20] alright, slinking back into my corner. [21:06:14] :D [21:13:17] (03PS1) 10Andrew Bogott: profile::wmcs::nfs::standalone: add -nfs postfix to service domain [puppet] - 10https://gerrit.wikimedia.org/r/761472 (https://phabricator.wikimedia.org/T301280) [21:13:33] (03PS1) 10Cathal Mooney: Adding includes for Netbox-generated zone files for new eqiad subnets [dns] - 10https://gerrit.wikimedia.org/r/761473 (https://phabricator.wikimedia.org/T299758) [21:15:54] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfs::standalone: add -nfs postfix to service domain [puppet] - 10https://gerrit.wikimedia.org/r/761472 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [21:24:28] (03CR) 10Volans: "Optional comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/761397 (https://phabricator.wikimedia.org/T296533) (owner: 10Majavah) [21:26:56] 10ops-eqiad: Allocate new cabs for WMCS in rows E/F Eqiad - https://phabricator.wikimedia.org/T301414 (10cmooney) [21:45:15] (03CR) 10Cathal Mooney: "I was told you are probably the best person to vet this by someone in traffic Riccardo, hope that's ok!" [dns] - 10https://gerrit.wikimedia.org/r/761473 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [21:50:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:51:20] !log T299422: Started Wikibase rebuildItemsPerSite in 100k page batches on mwmaint1002 for wikidatawiki. Can be killed at any time, if necessary. [21:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:25] T299422: Identify historic true duplicates in Wikidata - https://phabricator.wikimedia.org/T299422 [21:56:43] (03CR) 10Cwhite: [C: 03+2] "PCC OK https://puppet-compiler.wmflabs.org/pcc-worker1002/33673/" [puppet] - 10https://gerrit.wikimedia.org/r/761026 (https://phabricator.wikimedia.org/T300997) (owner: 10Cwhite) [21:58:00] (03PS1) 10AOkoth: admin: add bmansurov to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/761477 (https://phabricator.wikimedia.org/T301215) [22:00:12] (03CR) 10RLazarus: [C: 03+1] admin: add bmansurov to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/761477 (https://phabricator.wikimedia.org/T301215) (owner: 10AOkoth) [22:00:28] (03CR) 10AOkoth: [C: 03+2] admin: add bmansurov to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/761477 (https://phabricator.wikimedia.org/T301215) (owner: 10AOkoth) [22:03:11] 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10cmooney) 05Open→03In progress p:05Triage→03Medium [22:03:32] 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10cmooney) 05In progress→03Open [22:04:28] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10Arnoldokoth) @bmansurov This is sorted now. [22:05:15] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney) [22:05:37] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10Arnoldokoth) 05In progress→03Resolved a:03Arnoldokoth [22:12:45] (03PS1) 10MSantos: tegola: fix label cut on place_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/761481 (https://phabricator.wikimedia.org/T228612) [22:20:46] 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10wiki_willy) a:03Jclark-ctr [22:22:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298554)', diff saved to https://phabricator.wikimedia.org/P20434 and previous config saved to /var/cache/conftool/dbconfig/20220209-222231-ladsgroup.json [22:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:37] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [22:25:27] (03PS1) 10Dzahn: test commit [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/761485 [22:25:28] PROBLEM - Check systemd state on elastic1038 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:34] 10SRE, 10observability, 10serviceops: basic prometheus monitoring for PoolCounter - https://phabricator.wikimedia.org/T237407 (10RLazarus) 05Open→03Resolved [22:30:56] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [22:34:54] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:37:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P20435 and previous config saved to /var/cache/conftool/dbconfig/20220209-223736-ladsgroup.json [22:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:07] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic[1032-1038,1040-1042,1044-1047].eqiad.wmnet [22:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:12] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P20437 and previous config saved to /var/cache/conftool/dbconfig/20220209-225240-ladsgroup.json [22:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:10] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:59:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:03:11] (03PS4) 10Ebernhardson: [WIP] query_service: Simplify jvm arg handling [puppet] - 10https://gerrit.wikimedia.org/r/761080 [23:03:25] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761080 (owner: 10Ebernhardson) [23:04:26] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/761491 [23:07:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298554)', diff saved to https://phabricator.wikimedia.org/P20438 and previous config saved to /var/cache/conftool/dbconfig/20220209-230745-ladsgroup.json [23:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:51] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [23:07:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [23:07:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [23:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [23:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [23:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:43] (03PS2) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/761491 [23:13:59] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/761491 (owner: 10Ahmon Dancy) [23:14:44] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/761491 (owner: 10Ahmon Dancy) [23:15:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) [23:18:03] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic[1032-1038,1040-1042,1044-1047].eqiad.wmnet [23:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10Jclark-ctr) [23:26:26] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:26:43] !log T294805 manual remediation for decom cookbook failure on elastic1046 as described in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_ANY_Opsen [23:26:44] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [23:33:29] 10SRE, 10Analytics-Radar, 10Traffic-Icebox, 10Patch-For-Review: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967 (10Ladsgroup) 05Open→03Declined It seems RFC 6265 actually brought back Expires (https:... [23:37:24] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10colewhite) I think this can be closed: declined. With the migration to OpenSearch and the future of Curator unclear, I think t... [23:39:11] 10SRE, 10Performance-Team, 10Security-Team, 10Security, 10user-sbassett: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10Mstyles) [23:48:24] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:48:30] !log apt1001 - delete etherpad-lite for bullseye source package, built, uploaded and imported 1.8.16-2 in bullseye-wikimedia, now source and binary packages in APT, simulated install on etherpad1003 works T300568 [23:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:35] T300568: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 [23:49:34] (03Abandoned) 10Ladsgroup: varnish: Replace "Expires" in Set-Cookie with "Max-Age" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967) (owner: 10Ladsgroup) [23:51:19] 10SRE, 10Performance-Team, 10Security-Team, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10JJMC89) [23:53:53] (Juniper alarm active) firing: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active got acknowledged - https://alerts.wikimedia.org [23:54:17] (03Abandoned) 10Dzahn: test commit [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/761485 (owner: 10Dzahn)