[00:01:46] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:12:12] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:14:20] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:15:28] <cscott>	 dduvall I have a patch for the zhwiki toc issue, don't know how you want to handle it
[00:20:38] <dduvall>	 cscott: thanks for getting that so quickly. I can't review but if it merges and we can wrangle an sre for Friday deployment, I'm willing to backport it and sync
[00:26:20] <cscott>	 subbu is reviewing, I think
[00:51:21] <subbu[m]>	 +2ed 
[00:54:09] <subbu[m]>	 dduvall: +2ed .. let me know if it is better deploying now or sat morning CST.
[00:56:18] <dduvall>	 subbu[m], cscott: now is better for me but let's see if we can find an sre
[00:56:26] <subbu[m]>	 ok.
[00:57:28] <dduvall>	 asking in -sre
[00:59:53] <subbu[m]>	 afk for 10.
[01:07:43] <subbu[m]>	 ok, zuul is still at it.
[01:07:47] <dduvall>	 legoktm: just waiting on gate-and-submit
[01:07:56] <legoktm>	 ack
[01:08:14] <dduvall>	 i should get the cherry-pick going in parallel i suppose
[01:08:44] <wikibugs>	 (03PS1) 10Dduvall: Regression fix: do language conversion on ToC in ParserOutput::getText() [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737079 (https://phabricator.wikimedia.org/T295187)
[01:09:44] <cscott>	 I'm here watching fwiw
[01:10:03] <dduvall>	 thank you
[01:12:01] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Regression fix: do language conversion on ToC in ParserOutput::getText() [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737079 (https://phabricator.wikimedia.org/T295187) (owner: 10Dduvall)
[01:13:08] <dduvall>	 oof. 21 min and counting for the master branch change
[01:13:36] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:14:00] <dduvall>	 there it goes. ok, now waiting on the backport
[01:20:26] <dduvall>	 cscott, subbu[m] will this be testable on mwdebug?
[01:25:19] <legoktm>	 I would think so, just need a snippet of the right wikitext to throw at api.php?action=parse
[01:27:20] <subbu[m]>	 ya .. cscott knows better. 
[01:28:44] <dduvall>	 alright
[01:29:30] <dduvall>	 legoktm: i'm just now realizing i haven't done a backport deployment since before the mwdebug helmfile sync was setup. should i still scap pull on the server or run helmfile or... wait for the sync?
[01:30:11] <dduvall>	 </embarrassing question>
[01:30:15] <legoktm>	 the helm stuff is all automated, for now you can just ignore it and do scap pull/sync like you normally do
[01:30:24] <dduvall>	 got it, ok
[01:31:07] <legoktm>	 once the backport merges, it kicks off a new image build, the auto-deploy script sees the new image version, and then immediately helmfile deploys it, regardless of what the git/scap state is.
[01:31:45] <wikibugs>	 (03Merged) 10jenkins-bot: Regression fix: do language conversion on ToC in ParserOutput::getText() [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737079 (https://phabricator.wikimedia.org/T295187) (owner: 10Dduvall)
[01:32:23] <dduvall>	 legoktm: ack. thank you
[01:33:58] <dduvall>	 !log performing emergency backport deployment of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/737079
[01:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:35:13] <cscott>	 <legoktm> "I would think so, just need a..." <- Depends on whether language converter is enabled, that's probably the tricky part.
[01:35:45] <legoktm>	 once it's on mwdebug1001, you can just navigate to zhwiki and try the patch there
[01:36:34] <cscott>	 Crhwiki would be easier, Cyrillic is a lot easier to distinguisj
[01:36:52] <dduvall>	 cscott, subbu[m] now on mwdebug1002
[01:36:56] <cscott>	 Let me make sure my mwdebug extension is set up and working
[01:37:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[01:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:13] <legoktm>	 all wikis are on wmf.7, so you have your pick :)
[01:37:30] <dduvall>	 :)
[01:37:42] <subbu[m]>	 ok ...
[01:38:02] <AntiComposite>	 we should probably have $wgUsePigLatinVariant = true; on a testwiki
[01:38:55] <cscott>	 Ok it works on crhwiki on mw1002
[01:39:11] <cscott>	 https://crh.wikipedia.org/w/index.php?title=Birle%C5%9Fken_Milletler_Te%C5%9Fkil%C3%A2t%C4%B1&variant=crh-cyrl
[01:39:27] <dduvall>	 \o/
[01:39:33] <cscott>	 ToC in cyrillic when you hit that from mwdebug1002
[01:39:58] <cscott>	 i'll stare hard at zhwiki now and see if I can see the ToC characters change shape too :)
[01:40:19] <dduvall>	 ok. just give me the go ahead when you're done
[01:40:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[01:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:41:32] <cscott>	 yes, i can see some characters change on https://zh.wikipedia.org/zh-sg/%E4%BA%9E%E9%A6%AC%E9%81%9C%E7%9B%86%E5%9C%B0 as well.  so thumbs up.
[01:41:54] <dduvall>	 right on. thank you
[01:41:56] <cscott>	 (first character in TOC item #3 on that page)
[01:42:34] <dduvall>	 !log emergency backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/737079 deployed and verified on mwdebug1002. syncing to all targets
[01:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:43:43] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php-1.38.0-wmf.7/includes/parser/ParserOutput.php: Backport: [[gerrit:737079|Regression fix: do language conversion on ToC in ParserOutput::getText() (T295187)]] (duration: 00m 56s)
[01:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:43:46] <stashbot>	 T295187: Chinese conversion no longer work in the table of content - https://phabricator.wikimedia.org/T295187
[01:45:13] <dduvall>	 cscott: it's everywhere
[01:45:58] <cscott>	 yup, looks good now even with mwdebug off
[01:46:42] <legoktm>	 and I still see TOC on other articles where it's supposed to be there
[01:46:49] <subbu[m]>	 thanks everyone!
[01:46:56] <cscott>	 t-shirts for everyone :)
[01:47:06] <dduvall>	 yay! big thanks
[01:47:07] <subbu[m]>	 for next time, we should figure out how to deal with this parsercache purge issue.
[01:47:11] <cscott>	 thanks all for the late friday fire drill
[01:47:17] <dduvall>	 time to finally open that high ABV IPA that awaits me on fridays
[01:47:27] <cscott>	 and on monday i want to write proper parser tests to catch issues like this in the future
[01:47:44] <legoktm>	 happy weekend :)
[01:47:55] <dduvall>	 you too! :)
[01:47:56] <subbu[m]>	 signing off
[01:48:19] <cscott>	 g'night
[01:57:25] <p858snake>	 The next badge target is to do it whilst on a plane >.>
[01:57:56] <dduvall>	 :D
[03:16:40] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 261 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:22:44] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:25:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:41:06] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 613.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:43:10] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211106T0700)
[08:38:58] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:42:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add comment about Druid data retention for webrequest_sampled_128 [puppet] - 10https://gerrit.wikimedia.org/r/737097 (owner: 10Elukey)
[09:37:38] <wikibugs>	 (03PS11) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077
[09:38:12] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:40:19] <wikibugs>	 (03PS12) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077
[09:40:33] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah)
[09:47:10] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10Majavah)
[11:19:01] <wikibugs>	 (03PS13) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077
[11:24:55] <wikibugs>	 (03CR) 10Majavah: "Testing this on deployment-prep currently fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah)
[11:42:00] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:03:09] <wikibugs>	 (03PS11) 10JMeybohm: Add Jetstack's cert-manager (v1.5.4) images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T294560) (owner: 10Elukey)
[12:09:18] <wikibugs>	 (03PS12) 10JMeybohm: Add Jetstack's cert-manager (v1.5.4) images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T294560) (owner: 10Elukey)
[12:13:12] <wikibugs>	 (03PS1) 10JMeybohm: Import chart cert-manager v1.5.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/737167 (https://phabricator.wikimedia.org/T294560)
[13:19:23] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10Aklapper) CC'ing @joanna_borun for input.  Some historical context: {T285143}; {T84868}; {T127556}
[14:06:03] <wikibugs>	 (03PS1) 10JMeybohm: Add cfssl-issuer and cfssl-issuer-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560)
[14:06:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cfssl-issuer and cfssl-issuer-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737169 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm)
[14:52:56] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:45:19] <wikibugs>	 (03CR) 10Awight: Variant configuration: Allow for YAML-based inheritance of configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[18:45:56] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 104, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:57:30] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: Drop python 2 redis client [puppet] - 10https://gerrit.wikimedia.org/r/737173 (https://phabricator.wikimedia.org/T295235)
[19:15:34] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:15:58] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:50:56] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:12:46] <icinga-wm>	 PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:13:46] <icinga-wm>	 RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:43:38] <icinga-wm>	 PROBLEM - snapshot of s4 in eqiad on alert1001 is CRITICAL: snapshot for s4 at eqiad taken more than 3 days ago: Most recent backup 2021-11-03 21:22:20 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[21:52:16] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:54:22] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring