[00:01:46] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:12] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:20] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:28] dduvall I have a patch for the zhwiki toc issue, don't know how you want to handle it [00:20:38] cscott: thanks for getting that so quickly. I can't review but if it merges and we can wrangle an sre for Friday deployment, I'm willing to backport it and sync [00:26:20] subbu is reviewing, I think [00:51:21] +2ed [00:54:09] dduvall: +2ed .. let me know if it is better deploying now or sat morning CST. [00:56:18] subbu[m], cscott: now is better for me but let's see if we can find an sre [00:56:26] ok. [00:57:28] asking in -sre [00:59:53] afk for 10. [01:07:43] ok, zuul is still at it. [01:07:47] legoktm: just waiting on gate-and-submit [01:07:56] ack [01:08:14] i should get the cherry-pick going in parallel i suppose [01:08:44] (03PS1) 10Dduvall: Regression fix: do language conversion on ToC in ParserOutput::getText() [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737079 (https://phabricator.wikimedia.org/T295187) [01:09:44] I'm here watching fwiw [01:10:03] thank you [01:12:01] (03CR) 10Dduvall: [C: 03+2] Regression fix: do language conversion on ToC in ParserOutput::getText() [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737079 (https://phabricator.wikimedia.org/T295187) (owner: 10Dduvall) [01:13:08] oof. 21 min and counting for the master branch change [01:13:36] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:14:00] there it goes. ok, now waiting on the backport [01:20:26] cscott, subbu[m] will this be testable on mwdebug? [01:25:19] I would think so, just need a snippet of the right wikitext to throw at api.php?action=parse [01:27:20] ya .. cscott knows better. [01:28:44] alright [01:29:30] legoktm: i'm just now realizing i haven't done a backport deployment since before the mwdebug helmfile sync was setup. should i still scap pull on the server or run helmfile or... wait for the sync? [01:30:11] [01:30:15] the helm stuff is all automated, for now you can just ignore it and do scap pull/sync like you normally do [01:30:24] got it, ok [01:31:07] once the backport merges, it kicks off a new image build, the auto-deploy script sees the new image version, and then immediately helmfile deploys it, regardless of what the git/scap state is. [01:31:45] (03Merged) 10jenkins-bot: Regression fix: do language conversion on ToC in ParserOutput::getText() [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737079 (https://phabricator.wikimedia.org/T295187) (owner: 10Dduvall) [01:32:23] legoktm: ack. thank you [01:33:58] !log performing emergency backport deployment of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/737079 [01:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:13] "I would think so, just need a..." <- Depends on whether language converter is enabled, that's probably the tricky part. [01:35:45] once it's on mwdebug1001, you can just navigate to zhwiki and try the patch there [01:36:34] Crhwiki would be easier, Cyrillic is a lot easier to distinguisj [01:36:52] cscott, subbu[m] now on mwdebug1002 [01:36:56] Let me make sure my mwdebug extension is set up and working [01:37:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:13] all wikis are on wmf.7, so you have your pick :) [01:37:30] :) [01:37:42] ok ... [01:38:02] we should probably have $wgUsePigLatinVariant = true; on a testwiki [01:38:55] Ok it works on crhwiki on mw1002 [01:39:11] https://crh.wikipedia.org/w/index.php?title=Birle%C5%9Fken_Milletler_Te%C5%9Fkil%C3%A2t%C4%B1&variant=crh-cyrl [01:39:27] \o/ [01:39:33] ToC in cyrillic when you hit that from mwdebug1002 [01:39:58] i'll stare hard at zhwiki now and see if I can see the ToC characters change shape too :) [01:40:19] ok. just give me the go ahead when you're done [01:40:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:32] yes, i can see some characters change on https://zh.wikipedia.org/zh-sg/%E4%BA%9E%E9%A6%AC%E9%81%9C%E7%9B%86%E5%9C%B0 as well. so thumbs up. [01:41:54] right on. thank you [01:41:56] (first character in TOC item #3 on that page) [01:42:34] !log emergency backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/737079 deployed and verified on mwdebug1002. syncing to all targets [01:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:43] !log dduvall@deploy1002 Synchronized php-1.38.0-wmf.7/includes/parser/ParserOutput.php: Backport: [[gerrit:737079|Regression fix: do language conversion on ToC in ParserOutput::getText() (T295187)]] (duration: 00m 56s) [01:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:46] T295187: Chinese conversion no longer work in the table of content - https://phabricator.wikimedia.org/T295187 [01:45:13] cscott: it's everywhere [01:45:58] yup, looks good now even with mwdebug off [01:46:42] and I still see TOC on other articles where it's supposed to be there [01:46:49] thanks everyone! [01:46:56] t-shirts for everyone :) [01:47:06] yay! big thanks [01:47:07] for next time, we should figure out how to deal with this parsercache purge issue. [01:47:11] thanks all for the late friday fire drill [01:47:17] time to finally open that high ABV IPA that awaits me on fridays [01:47:27] and on monday i want to write proper parser tests to catch issues like this in the future [01:47:44] happy weekend :) [01:47:55] you too! :) [01:47:56] signing off [01:48:19] g'night [01:57:25] The next badge target is to do it whilst on a plane >.> [01:57:56] :D [03:16:40] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 261 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:22:44] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:25:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:06] PROBLEM - MariaDB Replica Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 613.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:43:10] RECOVERY - MariaDB Replica Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211106T0700) [08:38:58] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:42:37] (03CR) 10Elukey: [C: 03+2] Add comment about Druid data retention for webrequest_sampled_128 [puppet] - 10https://gerrit.wikimedia.org/r/737097 (owner: 10Elukey) [09:37:38] (03PS11) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [09:38:12] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:19] (03PS12) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [09:40:33] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [09:47:10] 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10Majavah) [11:19:01] (03PS13) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [11:24:55] (03CR) 10Majavah: "Testing this on deployment-prep currently fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [11:42:00] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:03:09] (03PS11) 10JMeybohm: Add Jetstack's cert-manager (v1.5.4) images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T294560) (owner: 10Elukey) [12:09:18] (03PS12) 10JMeybohm: Add Jetstack's cert-manager (v1.5.4) images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T294560) (owner: 10Elukey) [12:13:12] (03PS1) 10JMeybohm: Import chart cert-manager v1.5.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/737167 (https://phabricator.wikimedia.org/T294560) [13:19:23] 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10Aklapper) CC'ing @joanna_borun for input. Some historical context: {T285143}; {T84868}; {T127556} [14:06:03] (03PS1) 10JMeybohm: Add cfssl-issuer and cfssl-issuer-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560) [14:06:29] (03CR) 10jerkins-bot: [V: 04-1] Add cfssl-issuer and cfssl-issuer-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:52:56] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:19] (03CR) 10Awight: Variant configuration: Allow for YAML-based inheritance of configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:45:56] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 104, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:57:30] (03PS1) 10Majavah: dynamicproxy: Drop python 2 redis client [puppet] - 10https://gerrit.wikimedia.org/r/737173 (https://phabricator.wikimedia.org/T295235) [19:15:34] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:15:58] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:50:56] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:46] PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:13:46] RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:43:38] PROBLEM - snapshot of s4 in eqiad on alert1001 is CRITICAL: snapshot for s4 at eqiad taken more than 3 days ago: Most recent backup 2021-11-03 21:22:20 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:52:16] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:54:22] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring