User Details
- User Since
- Jul 23 2024, 9:16 AM (44 w, 17 h)
- Availability
- Available
- IRC Nick
- tappof
- LDAP User
- Tiziano Fogli
- MediaWiki User
- Tiziano Fogli [ Global Accounts ]
Yesterday
sre business hours escalation reset to: escalate to americas/emea, if unacked for 5 minutes escalate to batphone
overrides in case they are reset with batphone change
Mon, May 26
Thank you @andrea.denisse for your feedback.
I've just updated the task description, apologies for the previous version, it was drafted and pushed in a hurry.
Regarding the 15-minute delay, I took that number from here, and it seems to represent the maximum time for a Ganeti-to-Netbox sync.
Ok, thank you @RobH. I’ll add some Pint directives to silence alerts for missing metrics in the DCs that don’t have Pro4X PDUs installed, and then I’ll go ahead and merge the patch today.
Fri, May 23
# add prometheus7002 to manifestes/site.pp node /^prometheus[3456]00[1-9]\.(esams|ulsfo|eqsin|drmrs)\./ { role(prometheus::pop) }
Sure @RobH, nothing will catch fire because of this patch (or maybe it will, since we’re talking about electric current) :) Thanks for taking a look.
Thu, May 22
Hi @wiki_willy,
based on the breaker alerts currently configured for the Sentry4 model, I’ve set up the same alert for the Pro4X model directly in Prometheus/Alertmanager.
I’m looking for someone to review the patch on Gerrit — ideally with an eye on the electrical aspects of the setup, as I lack domain-specific knowledge in that area.
Thank you
Mon, May 19
The issue happened again.
May 15 11:54:57 alert1002 ircecho[1319087]: Error writing: %sDropping this message: "%s" May 15 11:54:57 alert1002 ircecho[1319087]: Exception in thread Thread-11: May 15 11:54:57 alert1002 ircecho[1319087]: Traceback (most recent call last): May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/local/bin/ircecho", line 187, in process_IN_MODIFY May 15 11:54:57 alert1002 ircecho[1319087]: bot.connection.privmsg(chans, out) May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/lib/python3/dist-packages/irc/client.py", line 854, in privmsg May 15 11:54:57 alert1002 ircecho[1319087]: self.send_raw("PRIVMSG %s :%s" % (target, text)) May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/lib/python3/dist-packages/irc/client.py", line 884, in send_raw May 15 11:54:57 alert1002 ircecho[1319087]: raise ServerNotConnectedError("Not connected.") May 15 11:54:57 alert1002 ircecho[1319087]: irc.client.ServerNotConnectedError: Not connected. May 15 11:54:57 alert1002 ircecho[1319087]: During handling of the above exception, another exception occurred: May 15 11:54:57 alert1002 ircecho[1319087]: Traceback (most recent call last): May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner May 15 11:54:57 alert1002 ircecho[1319087]: self.run() May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/local/bin/ircecho", line 50, in run May 15 11:54:57 alert1002 ircecho[1319087]: self.notifier.loop() May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/lib/python3/dist-packages/pyinotify.py", line 1376, in loop May 15 11:54:57 alert1002 ircecho[1319087]: self.process_events() May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/lib/python3/dist-packages/pyinotify.py", line 1275, in process_events May 15 11:54:57 alert1002 ircecho[1319087]: self._default_proc_fun(revent) May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/lib/python3/dist-packages/pyinotify.py", line 910, in __call__ May 15 11:54:57 alert1002 ircecho[1319087]: return _ProcessEvent.__call__(self, event) May 15 11:54:57 alert1002 ircecho[1319087]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/lib/python3/dist-packages/pyinotify.py", line 630, in __call__ May 15 11:54:57 alert1002 ircecho[1319087]: return meth(event) May 15 11:54:57 alert1002 ircecho[1319087]: ^^^^^^^^^^^ May 15 11:54:57 alert1002 ircecho[1319087]: File "/usr/local/bin/ircecho", line 190, in process_IN_MODIFY May 15 11:54:57 alert1002 ircecho[1319087]: print('Error writing: %s' May 15 11:54:57 alert1002 ircecho[1319087]: TypeError: unsupported operand type(s) for %: 'NoneType' and 'tuple' May 15 11:56:04 alert1002 ircecho[1319087]: Connected
Mon, May 12
Above the list of all the wikis and their respective 'x', you'll see a list with checkboxes to select or deselect each entry. Did you try using that?
Tue, May 6
I merged the patch for T387866: Configure PDU monitoring resources through NetBox, the PDUs located in drmrs are now split by rack.
Mon, May 5
Fri, May 2
Hi @cmassaro,
It looks like the "Requested group membership" field is missing from your form.
Could you please let us know which group(s) you need to be added to?
Thanks!
@Ospingou, could you please sign off on the access request? Thank you!
Wed, Apr 30
@wiki_willy, please take a look at T387866: Configure PDU monitoring resources through NetBox. This will change how the row label is set and will also fix the splitting issue with the drmrs PDUs.
The only concern I have with dropping metrics based on a given label is that the _sum and _count values will no longer reflect the actual buckets stored in the TSDB.
It might be pointless for us, but it's just something to keep in mind
Merging the patch https://gerrit.wikimedia.org/r/1135022 will result in the following change to the Prometheus configs:
Hi @Madalina,
While I was checking, I noticed that you've already been added to the group.
Tue, Apr 29
Actually, they're defined in Puppet like this:
Mon, Apr 28
@wiki_willy, I was able to split the PDUs in a 'per row' manner. If you're looking at a PoP, this is equivalent to 'per rack'. Only in the two main DCs is there a real distinction between rack and row.
Otherwise, I can split them 'per instance', meaning one graph per PDU.
Let me know what you prefer.
Hey @wiki_willy, thanks for the feedback! I'll take a look at your request and let you know.
Apr 18 2025
@wiki_willy I've just finished updating the dashboard to include the information scraped from Magru's PDU.
Please have a look and let me know if anything doesn't look right.
Apr 16 2025
Apr 9 2025
You're free to close this task, since all the checks have been migrated to the corresponding HTTP version, as per T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance, and no alerts have been triggered.
Thank you.
Apr 8 2025
07:53:41 phedenskog │ Great, I'll do the alerts later today and send them for review. Thank you! For T390166 let us chat about that first before you start │ so I can show you what we use today with the old setup. 08:50:17 │ [tappof back: gone 13:46:16] 09:04:52 tappof │ hey good morning 09:06:02 tappof │ If you'd like to chat about what you'd like to speed up in your dashboard, I'm available 10:33:42 phedenskog │ Hello! Ok. So I think there's two uses cases for speeding up queries. One is if it's possible to make a dashboard like │ https://grafana.wikimedia.org/d/rum/real-user-monitoring?orgId=1 faster? There's so many different tags there that you can change (and │ many different graphs). Do you think that is doable? 10:37:08 phedenskog │ The other way would be to choose a couple of metrics and tags and make it possible to look at data over long time. We used dashboards │ like https://grafana.wikimedia.org/d/000000143/navtiming-overall-history?orgId=1 where we look at a metric and can see trends over │ year(s). 10:37:35 phedenskog │ However as it is now I think the most important is the alerts and that we can look back at data one month and that work now. 10:55:49 tappof │ So, the widgets in this one (https://grafana-rw.wikimedia.org/d/000000143/navtiming-overall-history?orgId=1) still need to be │ converted to the corresponding Prometheus expressions. I think we can come back to it once they've been migrated. Are you ok with │ that? 11:01:49 phedenskog │ Yes I don't think we should do anything now and I think we will drop viewing of data like that. I think since it's not decided │ organisational who/which team is responsible for those dashboards, it makes sense to not spend time on making them/speeding up data. │ If later on it is decided, then that person/team can decide what to do. 11:03:24 phedenskog │ For some of the old Graphite dashboards: I plan not to remove them, but save them with a timespan so that we can use them to see back │ in time, and then I'll change the autoamtic message you team added about the deprecation. Is that ok? So we keep that dashboard as │ long as we still has Graphite as read only. 11:05:17 tappof │ Yeah, I think it's okay. I'll also talk to the team about adding a tag to every dashboard that needs to survive the sunsetting, due to │ historical data consumption. 11:11:45 tappof │ About this one: https://grafana-rw.wikimedia.org/d/rum/real-user-monitoring?orgId=1&forceLogin&editPanel=25 — it might be challenging │ since you're applying a lot of filters. I'll take a look. First of all, I think we'll need to remove the timespan setting, since it's │ hardcoded into the recording rule. 11:11:54 tappof │ Instead, to smooth the curves a bit, we should use an avg_over_time. I just checked, and even when pre-filtering with recording rules │ only on country, continent, browser, platform, and group, we still end up with around 11k series... still too many. 11:14:48 tappof │ I'll think about that and how to rearrange the dashboard. It might be necessary to split it into multiple dashboards. 11:14:56 tappof │ A question: are you used to querying this kind of data on a 'per-country' basis? 11:24:24 tappof │ Filtering only by geo_continent, mw_skin, and ua_family will reduce the cardinality to 1,800 series. While that's still a lot, it’s │ much more manageable and acceptable. 11:43:46 phedenskog │ No, the per country is not used today I think. The idea was that potentially when we had focus countries for example India, we could │ have alerts to that country. What we used it in the past, is that we oversampled some countries and then looked at that data. Today we │ take 1 request out of 100 and beacon back that data, but we have the functionality to send for example 100% of data for users that │ access from India. But we don't 11:43:46 phedenskog │ use that today (since there's no team that work with it).
Apr 7 2025
The 'init', 'total', and 'timePerInputKB' metrics haven't received data since 27/10/2023.
I’ve just finished updating the dashboard: all widgets that previously relied on high-cardinality metrics are now based on the corresponding recording rules.
Mar 27 2025
dashboard migrated
https://grafana.wikimedia.org/goto/thS_QhTNg?orgId=1
Mar 25 2025
Mar 19 2025
Opened an issue on the pyrra-dev/pyrra github repository:
Mar 18 2025
Mar 17 2025
It looks like the patch has fixed the problem.
Mar 13 2025
I was implementing the checks as decided. Before proceeding, I manually ran a telnet from the proxies and found that they are unable to reach the anchors.
Wouldn't stripping down labels without aggregating data lead to losing data?
Mar 12 2025
Mar 11 2025
I'm reopening the task just to wait for feedback on the dashboard review.
This is a proposal for migrating the wt2html dashboard: https://grafana.wikimedia.org/goto/aPt4onhNR?orgId=1 .
. Some widgets have (???) in their titles and need review.
During the activities for T371087, the original retention time of 730h was adjusted to the standard value of 4032h.
In agreement with @fgiunchedi, we restored the original value
Before addressing Filippo's suggestions, I checked the situation in Esams, where prometheus7001 is able to perform ICMP tests against the anchor in the same data center.
Mar 10 2025
irc logs:
11:15:30 tappof │ arturo: dcaro Just a heads-up in case of any unwanted alerts: I'm merging this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100819 11:15:59 arturo │ tappof: thanks 🚢🇮🇹 11:15:59 dcaro │ tappof: ack! 11:52:06 arturo │ tappof: indeed we just had the alert firing 11:52:21 arturo │ https://usercontent.irccloud-cdn.com/file/SlS4y6la/image.png 11:52:34 arturo │ and T388379 11:52:34 +stashbot │ T388379: ProbeDown - https://phabricator.wikimedia.org/T388379 11:52:46 tappof │ arturo: yes, I've seen 11:54:52 arturo │ tappof: are you able to investigate? I'm about to jump into a meeting 11:55:17 tappof │ arturo: yes, I'll check soon 11:55:22 arturo │ thanks 12:02:21 <-- │ dwalden ([email protected]) has quit (Quit: ZNC 1.8.2+deb2+b1 - https://znc.in) 12:02:37 --> │ dwalden [dwalden] (ZNC - https://znc.in) ([email protected]) has joined #wikimedia-cloud 12:07:25 --> │ PhantomTech [PhantomTech] (en:User:PhantomTech) (~PhantomTe@wikipedia/PhantomTech) has joined #wikimedia-cloud 12:34:05 -- │ tgr|away is now known as tgr_ 13:06:28 tappof │ arturo: We didn't have the IPv6 check before... https://w.wiki/DNG7 AFAICS, Prometheus in eqiad can reach cloudgw2002-dev on the IPv6 VIP, but the reply gets lost │ somewhere https://snipboard.io/XJ0gn9.jpg 13:06:57 arturo │ topranks: oh, ok! 13:07:40 arturo │ we may want to create a ticket to investigate why that happens, and remove the IPv6 check meanwhile 13:10:05 tappof │ arturo: I think we can use T388379 as the task for this one. I'll link it to the task related to monitoring to keep track of the relationship. 13:10:06 +stashbot │ T388379: ProbeDown - https://phabricator.wikimedia.org/T388379 13:10:29 tappof │ is it ok for you arturo ? 13:11:26 arturo │ ok!
Mar 7 2025
All the patches needed to achieve this goal can be found in task T387231: missing pdu infos for magru, which is the task that initiated this milestone
Mar 6 2025
Mar 5 2025
Information about the model type was added to Netbox-Hiera through this PS: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1124142
Mar 4 2025
A portion of the job has been completed to finalize task T387231: missing pdu infos for magru.
@Papaul No, I'm working on setting these objects only in Prometheus.
Feb 27 2025
I can confirm that it's a different model and responds to different MIBs (Raritan-PDU2-MIB). I'll proceed with setting up the scraping and let you know once it's done.
Feb 26 2025
Thank you, @wiki_willy, for pointing me in the right direction within NetBox. It seems the PuppetQL query might need to be updated (different model and/or type?). I'll take a look tomorrow.
@wiki_willy The data is missing because Prometheus is not configured to retrieve metrics from magru's PDUs, as they are not present in NetBox. As soon as they are added to NetBox, they will be automatically recognized by Prometheus. Once done, let us know if you need further help setting up the dashboard.