Wikipedia talk:Wikipedia Signpost/Single/2025-04-09
Comments
The following is an automatically-generated compilation of all talk pages for the Signpost issue dated 2025-04-09. For general Signpost discussion, see Wikipedia talk:Signpost.
Debriefing: Giraffer's RfA debriefing (855 bytes · 💬)
Thank you very much for this interesting report! More attention should be paid to such problems. --ssr (talk) 15:15, 12 April 2025 (UTC)
Congrats. I only came in with my neutral vote because I had a notification that linked me to your comments. My assumption was that because I got the link, that I was probably supposed to vote, and that's why I voted neutral. I genuinely haven't really cared about backend stuff, and I was like a man wandering in off the street into a meeting I didn't know anything about. Also, thank you for introducing me to the term "sockpuppet." And the term "RfA." --Guylaen (talk) 18:40, 17 April 2025 (UTC)
In focus: WMF to explore "common standards" for NPOV policies; implications for project autonomy remain unclear (13,789 bytes · 💬)
- It's possible that this NPOV project could have some utility for very small Wikipedias, but I can't see it generating anything useful for English Wikipedia, and if it's a stalking horse for imposition of WMF-created policies here, that would be very bad. Seems like a solution in search of a problem. Posted some thoughts here on the Meta talk page. —Ganesha811 (talk) 19:42, 9 April 2025 (UTC)
- I'm inclined to agree - and in general I don't see WMF joining the global shift toward consolidating authority as being a good sign.
- If they want to change our policies, NPOV is probably not even the first thing I would go for, there are a handful of other things we could fix. That said, any changes to policy should include heavy community feedback - we will only lose even more good editors if they aren't involved.ASUKITE 20:04, 9 April 2025 (UTC)
- The WMF should not have any say in the content of the Wikipedia projects. The basic independence of each Wikipedia project is what makes it stronger and resilient against influence and manipulation. Centralizing control into one institution means making Wikipedia much more vulnerable to capture and influence, and is also against the spirit of the project itself Ita140188 (talk) 08:21, 10 April 2025 (UTC)
- I'm fairly new to editing Wikipedia and tend to stick to small things like adding a source or whatever here and there, and I mostly agree with you but also I think the scale of Wikipedia may make it difficult to notice things that subtly go against the established policies.
- I won't claim I know all of the established policies here but I generally view it the same as I view laws in real life, which is for the most part the difference between right and wrong is intuitive and those times where it isn't you should be able to find the answer relatively easily or otherwise err on the side of you probably shouldn't do the thing you are questioning. I also tend to be pretty non-confrontational, especially in real life, but will not back down from a belief as long as it is grounded in logic or what is right and wrong - and I am not afraid to voice that.
- Point being, I think Wikipedia possibly has similar issues to a lot of other places, which is the sheer scale makes it easy to miss things and the complexity of established rules makes it so actually there are few rules except that those who know where to reference things and can argue their points well are the ones in charge and due to the complexity of the rules, those who are inexperienced or know they don't know all of the rules have a "chilling effect" of erring on the side of not doing the thing. The US legal system is the parallel example I am thinking of here, and actually, because this type of thing is seemingly happening in many places both in the physical and online world at the same time, and globally, they are all intersecting and making the problems worse elsewhere. Eg, since the laws in the real world are not being upheld evenly, people online are able to metaphorically "get away with murder" because if people aren't even enforcing blatant things in real life they aren't going to enforce things that aren't even technically illegal online, and the ones who "have the resources" referring to money and knowledge are able to do even worse.
- In just my short time editing on Wikipedia I have seen multiple examples of this.
- Earlier today on a talk page there is a back and forth about NPOV exactly where these issues of real life issues intersect with online problems on Wikipedia, as if it happened specifically so I could make this comment. I have also had issues in the past with the other example where those who know the rules of Wikipedia and how things work basically just run the show and have zero care about explaining things or any kind of working together with someone who may not agree with their points. I won't link to that because I'm not going to bother digging through multiple archived talk pages to find it but to put it simply I listed multiple issues with a page - as in, I didn't edit, but raised points on the talk page as it is advised to do - and a well established editor came in, archived the talk page, left some automated message about my complaints, then when I complained about them deleting my message on the talk page they... just did another automated message and removed my talk page comment again.
- Like I said, I am not afraid to speak up for what I believe, and I realize I may tend to "poke the bear" sometimes but I generally am pretty amicable and easy to get along with yet somehow I keep finding examples of people "in charge" being hostile to anyone raising issues with their decisions - both online and not. I keep finding my problem is I stand up for what I believe and don't take a load of nonsense because someone with a fancy title says they know the correct way to do things. So I mean, I am trying to write this in a way that doesn't raise too many issues, just like I do with issues IRL, and just like I do in basically every interaction in my entire life. I am at my core a peacemaker. Yet I am increasingly at the point where nobody else is saying anything and I know I am not the only one both seeing and experiencing major issues, again, both online and in real life. When the peacemakers are the ones causing problems, it seems likely there are large unresolved issues. Relevantusername2020 (talk) 09:42, 12 April 2025 (UTC)
- We have seen time and time again that volunteers have a strong track record of successfully managing neutrality on contentious subjects. That is of course nonsense, but if that's their position, what's the point of whatever it is they're doing? Coretheapple (talk) 20:22, 9 April 2025 (UTC)
- For people who speak more than one language, something that where I live means many people, Wikimedia is a very confusing place. They go to the supermarket and buy some pollastre, or pollo, or frango. And they get chicken, as they expect: different languages, same result. But not in Wikimedia. What is taken for granted in language A may be forbidden in language B, optional in language C or a restricted use version in language D. Two hundred and eigthy times that! So while I wouldn't start with neutral point of view policies (oh boy!: I wrote the whole thing, like a regular human being) I think that setting common standards is a real need. B25es (talk) 05:30, 10 April 2025 (UTC)
- We like to think that we have the "neutrality" issue pretty well resolved on Wikipedia, and for larger projects (including English) this is mostly true. But there are some major challenges that other projects face. For example: non-neutral, biased, low quality, or even very limited reference sources in the specific language; lack of policy structure that supports neutrality; no actual neutrality policy, or only a copy/paste of a neutrality policy from another language; questionable image selection; community-supported political statements or representations (such as flags) in the base interface. Should new Wikipedia projects be required to have at minimum a policy statement about neutrality before being moved from the incubator? How do we ensure that the best available quality image is used in articles, Wikisource, and other places? On which Wikimedia projects is an element of bias acceptable? (I'm thinking maybe Wikivoyage, where editorial choices determine which hotels, tourist attractions, etc. are included.) Many people forget that the WMF Board of Trustees made a very powerful statement about expectations related to biographies of living people back in 2009, and expanded it to address media about living people in 2013; it had the effect of setting minimum standards for all projects to follow, without directly impacting the content decisions of those projects.
We like to think that we maintain a pretty unbiased (i.e., neutral) project as English Wikipedia. Realistically, we have a massive infrastructure that works toward that goal. Look at the decisions made by Arbcom over the last 20 years: most of them have to do (directly or indirectly) with ensuring that this encyclopedia's content remains neutral by penalizing or removing editors who fail to work within the constraints of this core policy. We have extensive lists of reliable sources, and an entire noticeboard that directly addresses this issue. No other Wikimedia project has this extensive support structure. Many other Wikimedia projects, and indeed the global Wikimedia movement, have made use of the hard work and policy development that has taken place here to use as a baseline for project-specific or global policies and procedures. In the big picture, creating a minimum standard (likely based on the existing principles and policies from large Wikipedias) is more likely to be helpful to smaller projects or those that have limited resources. There is a lot to think about here. Disclosure: I've been tapped to work on the "neutrality" question as part of a working group of the WMF Board. Hence why I've been thinking about this already. Risker (talk) 20:53, 10 April 2025 (UTC)
- The Wikivoyage policy is voy:en:WV:Be fair, and I think it works pretty well for them. The usual approach is that you try to make a positive recommendation ("Lively modern restaurant and bar in the middle of the city's nightlife scene" rather than "Noisy, ugly place where people get drunk") if you can, and only make negative recommendations for obvious attractions ("Famous Attraction is popular, but it's also expensive, hot, and inconvenient, so you might consider alternatives, such as..."). The key difference between Wikipedia and Wikivoyage is the Wikipedia bases neutrality on published reliable sources with a bit of word choice/neutral tone, and Wikivoyage bases it mostly on original research.
- I agree that there could be some value in having the Board officially support NPOV, but I wonder whether a different name might be helpful. "Neutral" lends itself to WP:GEVAL errors, as people think it means even-handedness. Perhaps "Resolution:Against unfair bias from editors"? A few uncomfortable examples might also help, like "It is not neutral for a Wikipedia editor to take sides in the Kashmir conflict by writing that Kashmir definitely belongs to either Pakistan or to India. Instead, Wikipedia editors should follow what the reliable sources say, which is that both have claimed Kashmir. Commons should host maps showing all versions, including India with its claimed territorial borders, Pakistan with its claimed territorial borders, and maps clearly identifying the contested territory as being contested".
- WhatamIdoing (talk) 23:28, 15 April 2025 (UTC)
- HaeB, your co-authors are probably too new to remember this, but I think that the 2009 foundation:Resolution:Biographies of living people should be credited as the first time the Board attempted to reform the communities' core content policies across all languages. The 2011 foundation:Resolution:Controversial content could also be considered another attempt. WhatamIdoing (talk) 22:53, 15 April 2025 (UTC)
- Not to defend my co-authors too much on this one, but you might have overlooked in the revision history that I had amended it to the current wording specifically with the 2009 BLP resolution in mind.
- As for the (largely failed) controversial content attempt, I agree it is interesting too in this context, but I don't think WP:NOTCENSORED is a core content policy.
- Regards, HaeB (talk) 01:00, 16 April 2025 (UTC)
- Fair enough. Wikipedia:Core content policies sticks to just three (WP:V, NOR, and NPOV), excluding both NOT and BLP. WhatamIdoing (talk) 01:12, 16 April 2025 (UTC)
NPOV Issues on Wikipedia in the News!
ICYMI Slate published an article a couple days ago about the debate on Wikipedia about the current casino fluctuations and how to define it - and also about what Wikipedia itself actually is. I don't have the answers but thought it was interesting enough to note! WP:NOTNP Relevantusername2020 (talk) 00:40, 15 April 2025 (UTC)
In the media: Indian judges demand removal of content critical of Asian News International (7,483 bytes · 💬)
- I'm confused which article is affected by the Delhi High Court's decision. It would seem to be Asian News International itself, but I don't see any recent office actions there. —Compassionate727 (T·C) 19:11, 9 April 2025 (UTC)
- There has been no office actions on the ANI-article yet. Gråbergs Gråa Sång (talk) 09:33, 10 April 2025 (UTC)
- I believe it's Asian News International vs. Wikimedia Foundation. Tenshi! (Talk page) 19:51, 9 April 2025 (UTC)
- That page does not appear to be saved in the IA, pre-censorship. Is there any mirror anyone knows, outside the single interwiki (to Chinese Wikipedia)? It's ironic that the article exits, right now, only in Chinese... (zh:亚洲国际新闻诉维基媒体基金会案)). I'd have expected it would've been translated to other languages by now. PS. I found a copy here, seems to be from 16 October or so, at it mentions the take down order from that day. PPS. The Chinese article is superior, as it seems to be updated with post-take down content, up to March 2025 currently. --Piotr Konieczny aka Prokonsul Piotrus| reply here 00:28, 10 April 2025 (UTC)
- archive.today Gråbergs Gråa Sång (talk) 09:32, 10 April 2025 (UTC)
- Thanks, I need to start using this together with IA. Piotr Konieczny aka Prokonsul Piotrus| reply here 11:12, 10 April 2025 (UTC)
- archive.today Gråbergs Gråa Sång (talk) 09:32, 10 April 2025 (UTC)
- It's easy to get confused on current WP-ANI media coverage. The "thing" about mean content in Asian News International is presumably still ongoing, the Delhi High Court recently told WMF "Do what ANI wants" or something like that. We're all quite eager to see what that leads to.
- And at the same time, WMF has been talking about the DHC-ordered blanking of Asian News International vs. Wikimedia Foundation in the Supreme Court of India, and that court seems to have doubts on if that order was reasonable. How many of us have started a WP-article that was mentioned in a supreme court? And blanked by court-order, that's probably more distinctive. Gråbergs Gråa Sång (talk) 11:28, 10 April 2025 (UTC)
- That page does not appear to be saved in the IA, pre-censorship. Is there any mirror anyone knows, outside the single interwiki (to Chinese Wikipedia)? It's ironic that the article exits, right now, only in Chinese... (zh:亚洲国际新闻诉维基媒体基金会案)). I'd have expected it would've been translated to other languages by now. PS. I found a copy here, seems to be from 16 October or so, at it mentions the take down order from that day. PPS. The Chinese article is superior, as it seems to be updated with post-take down content, up to March 2025 currently. --Piotr Konieczny aka Prokonsul Piotrus| reply here 00:28, 10 April 2025 (UTC)
The WMF blanked the article about the court case, upon the demand of the court, which felt that the Wikipedia article discussing the case could prejudice the case itself. Perhaps not an unreasonable request.
As for the case itself. The claimant presumably wants "disparaging" statements, such as this statement in the article lead, to be removed from the article. This is where issues of Freedom of the Press come into play.
Long-form investigations by The Caravan and The Ken have described ANI as being closely associated with the government of India for decades, including under Congress Party rule, but especially after the 2014 election of the Bharatiya Janata Party (BJP). In 2019, The Caravan reported that ANI "has a disturbing history of producing blatant propaganda for the state".
Of course, views can vary about whether a statement is true fact, or blatant propaganda. My statement here is my own. This edit is not an endorsement of the WMF. – wbm1058 (talk) 12:20, 15 April 2025 (UTC)
Portal Kombat
- The Portal Combat (Russian disinformation) links numbered 1,907, mostly from ru and uk domains. While this needs action, it is a very small number. All the best: Rich Farmbrough 20:49, 10 April 2025 (UTC).
- @Rich Farmbrough Portal:Combat? No... is your link correct? Piotr Konieczny aka Prokonsul Piotrus| reply here 00:17, 11 April 2025 (UTC)
- @Rich Farmbrough Portal:Combat? No... is your link correct? Piotr Konieczny aka Prokonsul Piotrus| reply here 00:17, 11 April 2025 (UTC)
- No, it's Portal Kombat. ☆ Bri (talk) 05:00, 11 April 2025 (UTC)
- Ty! All the best: Rich Farmbrough 16:36, 8 May 2025 (UTC).
- Ty! All the best: Rich Farmbrough 16:36, 8 May 2025 (UTC).
- No, it's Portal Kombat. ☆ Bri (talk) 05:00, 11 April 2025 (UTC)
Isn't the section on the Trump nominees, by attacking living Wikipedians, a violation of WP:BLP?
There's no reason to think the statements about the editors are neutral, accurate, or unbiased. We're quoting attacks on Wikipedians, in Signpost voice. WE CANNOT DO THIS. Adam Cuerden (talk)Has about 8.9% of all FPs. 19:02, 14 April 2025 (UTC)
- @Adam Cuerden: I do not see the violation that you do, and I am also not sure which text is problematic, but I think this is the fix you are requesting - special:diff/1284867896/1285617502. Can you check that, and tell me if that improves the clarity of the message and resolves the issue you found? Thanks, we aim to be clear in communicating to everyone. Bluerasberry (talk) 19:16, 14 April 2025 (UTC)
- I think so, but kind of needs a retraction next issue, since it was published like this. Adam Cuerden (talk)Has about 8.9% of all FPs. 10:15, 15 April 2025 (UTC)
Balkan spring
Wasn't that made into a draft, rather than being deleted? So it's still readable - and by everyone, not just admins. DS (talk) 02:43, 29 April 2025 (UTC)
- Looks to me like it was deleted by the admin on 2 April upon AfD closure, then restored to draft space by the same admin at an editor's request on 9 April, about a week after the AfD was done, and in fact about 30 minutes after we published the article you are reading. ☆ Bri (talk) 04:11, 29 April 2025 (UTC)
News and notes: 35,000 user accounts compromised, locked in attempted credential-stuffing attack (0 bytes · 💬)
Wikipedia talk:Wikipedia Signpost/2025-04-09/News and notes
News from Diff: Strengthening Wikipedia’s neutral point of view (0 bytes · 💬)
Wikipedia talk:Wikipedia Signpost/2025-04-09/News from Diff
Obituary: RHaworth, TomCat4680 and PawełMM (3,089 bytes · 💬)
- I only knew Paweł through years of Wikimedia collaboration, but it's devastating to hear of his passing. He was kind, reliable and always generous with his time. I'll miss him more than I ever expected. RIP. ‑‑Neveselbert (talk · contribs · email) 22:11, 9 April 2025 (UTC)
- Link for official RHaworth obituary does not work. Could someone remedy that? Softlavender (talk) 22:40, 9 April 2025 (UTC)
- It works for me, here in Australia. Could it be a geolocation issue? Graham87 (talk) 02:34, 10 April 2025 (UTC)
- Someone fixed the link before you checked it, within four hours after I requested that it be fixed. Softlavender (talk) 00:32, 11 April 2025 (UTC)
- It works for me, here in Australia. Could it be a geolocation issue? Graham87 (talk) 02:34, 10 April 2025 (UTC)
- Deepest condolences to their friends and families.–Vulcan❯❯❯Sphere! 06:49, 10 April 2025 (UTC)
- These obits are oddly devastating. Pawel helped me many times at the photo lab and was always so lovely. I will miss his presence among us. jengod (talk) 12:20, 10 April 2025 (UTC)
- Rest in peace to all of them. I'm particularly gutted over Tom's death, as I got to know him while helping him work on the article for the show After Midnight. As Taylor Tomlinson herself would say, "Good job, Tom, take care...". Oltrepier (talk) 19:55, 11 April 2025 (UTC)
Being born in 1941
Wow, very rarely do we see editors from the Silent Generation. RIP (to all of them). Some1 (talk) 23:08, 13 April 2025 (UTC)- Roger was for a long time a regular at the London meetup. I remember him sitting there with his laptop, running queries to settle disputes. A sad loss. ϢereSpielChequers 08:18, 15 April 2025 (UTC)
- Was so sad to hear about Paweł, shall be dearly missed having been a stalwart graphist alongside myself in WP:GL/P for many a year! Thank you for your dedication all this time. Many condolences, RIP to all. Liandrei (talk) 21:44, 15 April 2025 (UTC)
- I never realised that this section existed until now... rest in peace to everyone hear. --ISometimesEatBananas (talk) 01:19, 23 April 2025 (UTC)
Op-ed: How crawlers impact the operations of the Wikimedia projects (20,685 bytes · 💬)
- Regardless of what anyone thinks of AI, the burden scraper bots create for website operators is something that should not be ignored. --Firestar464 (talk) 18:42, 9 April 2025 (UTC)
Misbehaving scraper bots are definitely a problem, and I have a lot of respect for the hard work that the post's authors and the rest of Wikimedia Foundation's SRE team do to keep the sites up. But reading between the lines, this Diff post points to some of the Wikimedia Foundation's own failings which may well also be a major cause of these current problems, alongside irresponsible scraping behavior. In more detail (recapping various points from a discussion about this post last week in the Wikipedia Weekly Facebook group, where various WMF staff already weighed in):
- 1. Is there a legitimate alternative?
The Foundation had already highlighted this excessive AI-related scraping traffic several months ago (see the draft annual plan section linked at the end of the post - like probably many Wikimedians, I had read it there before and didn't have second thoughts about it). However, in this post we now learn that it is especially driven by demand for the 144 million images, videos, and other files on Wikimedia Commons
. What's interesting about this: For Wikipedia text (i.e. the content that has long been used to train non-multimodal LLMs like the original ChatGPT, Claude, Llama etc.), the "good citizen" advice has always been to download it in form of the WMF's dumps, which puts much less strain on the infrastructure than sending millions of separate requests for individual web pages or hammering the APIs. But for those Commons media files which are apparently so in demand now, there has been no publicly available dump for over a decade. In other words, the "good citizen" method is not available for those who want download a large dataset of current Commons media files.
This might also explain another aspect of this WMF blog post that is rather puzzling: It does not at all mention Wikimedia Enterprise (the paid API access offered by the Wikimedia Foundation's for-profit subsidiary Wikimedia LLC). Normally, this would seem to be the perfect opportunity to advertise Enterprise to AI companies who inconsiderately overload our infrastructure without giving back, and tell them to switch the paid service instead. (Indeed, that's exactly what WMF/WM LLC representatives have done on previous occasions when publicly discussing the use and overuse of Wikipedia by AI companies, see e.g. our earlier Signpost coverage here.) - Not, however, if Wikimedia Enterprise has so far failed to address this apparently huge demand for Commons media files and neglected to build a paid API product for it. (Indeed I don't see such an offering on enterprise.wikimedia.com.)
- 2. If no approved alternative exist, and the Wikimedia Foundation works to disable existing methods of mass-downloading content, then it has effectively abandoned the right to fork.
The "Right To Fork" has been an important aspect of wikis since before Wikipedia was founded. It is part of the Wikimedia Foundation Guiding Principles:
we support the right of third parties to make and maintain licensing-compliant copies and forks of Wikimedia content and Wikimedia-developed code, regardless of motivation or purpose. While we are generally not able to individually assist such efforts, we enable them by making available copies of Wikimedia content in bulk, and avoiding critical dependencies on proprietary code or services for maintaining a largely functionally equivalent fork.
Note the regardless part - there is no exception like "unless it's for commercial gain" or "unless it's for AI training purposes" or "unless it might hurt the financial sustainability of the WMF or might reduce active editor numbers on wikipedia.org".
This "right to fork" is not exercised frequently, but it is very important as a governance safeguard. It enables anyone to launch a complete copy of (say) Wikipedia and/or Commons on a different website, if they think WMF has turned evil or is being taken over by the US government via executive order. For example, the community's ability to fork Wikipedia is very likely a major reason why Wikipedia does not contain ads today, see Signpost coverage: "Concerns about ads, US bias and Larry Sanger caused the 2002 Spanish fork". The WMF itself also mounted an aggressive legal defense of the right to fork Wikitravel and bring its entire content from a commercial host to Wikivoyage, see Signpost coverage: "Wikimedia Foundation declares 'victory" in Wikivoyage lawsuit".
And WMF's failure to mak[e] available copies of Wikimedia content in bulk
in case of Commons images has long been called out, see e.g. c:Commons:Requests_for_comment/Technical_needs_survey/Media_dumps and phab:T298394. (In fact, there didn't even exist an internal backup of until a 2016 Community wishlist request was addressed years later.)
Of course (as was also pointed out by WMF staff in the aforementioned Facebook discussion last week), addressing this longstanding missing dumps issue requires some work (and addressing some organizational dysfunction). But the proposed annual plan focus area already requests a substantial amount of resources for working on adversarial solutions (better tracking of downloaders for enforcement purposes etc.). This may be understandable from the SRE team's narrow perspective. However, overall, such an all stick no carrot approach is inconsistent with our mission, and by the way also fails to address this question asked in the draft annual plan section itself: How might we funnel users into preferred, supported channels?
- 3. Developments after the Diff post
As mentioned, in last week's Facebook discussion some WMF staff already responded to some of these concerns. (By the way, I'm not sure that the WMF would publish this Diff post in the exact form again today, and I'd like to note that although I'm a member of the Signpost team, I was not involved in the decision to republish it as part of this Signpost issue.) For example, WMF's Jonathan Tweed stated:
Hi everyone, I am a Product Manager working on WE5 at the Wikimedia Foundation, primarily on the attribution work that’s described in WE5.1. [...] We are looking into providing responsible ways for people to obtain images from Commons. Whether this is through creating new dumps, which is a non-trivial amount of work, or rate limited access is still under discussion, but at no point is our intention to block all downloading of images.
I look forward to diving deeper on these questions with the technical community over the next few months, including at the Hackathon in May.
Also, yesterday Giuseppe (one of the Diff post's authors) announced updates to the WMF's "Robot policy. Among other changes, the policy now explicitly recommends considering dumps (as WMF has done elsewhere before):
Check if you could use our dumps or other forms of offline collection of our data instead of making live requests. If that’s a viable option for your use case, it will reduce the strain on our very limited resources and make your life easier.
Regarding media files (from upload.wikimedia.org), it asks to "Always keep a total concurrency of at most 2, and limit your total download speed to 25 Mbps (as measured over 10 second intervals)."
It might be interesting to calculate what this means for exercising the right to fork in practical (duration) terms.
Regards, HaeB (talk) 19:23, 9 April 2025 (UTC)
- PS (to add another vignette illustrating the organizational dysfunction described under 2. above, where important work falls through the cracks between different WMF teams' turfs):
- It turns out that someone from the WMF Research department foresaw this need back in 2023 already and was
working on releasing more datasets that can help AI practitioners to work on models that are relevant to Wikimedia's needs. Two of those datasets that are particularly exciting deal with image data, however, which is a major challenge in our public data sharing infrastructure.
- In April 2024, it was decided to
not prioritize this task [...] Currently, hosting large dataset has been a challenge in the foundation, and this task has highlighted the needs of being able to do so. Given this functionality is the prerequisite of completing this task, and the resource/effort caused by this overhead, [a Principal Software Engineer from the WMF's Data Platform Engineering team] will be helping us with this regard to with the goal of helping researchers to access dumps, as well as potentially helping other parts of the organization/Enterprise. ETA 12-18 month.
- But half a year later, this task was made dependent on the WMF first producing an AI strategy for 2026-28, a separate task that is currently marked as due on Feb 5 2025 but still open (and it wouldn't be surprising to see it take another year or so to complete, the same task had previously already been due on Sep 29 2023 and then on Apr 30 2024).
- Regards, HaeB (talk) 21:00, 9 April 2025 (UTC)
- @HaeB: Thanks for providing this info. There is obviously a lot of it, most being from sources I'm not familiar with. It's the usual problem of Wikipedia being so large and conversations being so spread out. I think I do a good job keeping up with the press views about en.Wiki and much of the usual places for discussion on en.Wiki, but phabricator and facebook are not in my usual rounds. If I had that info I probably wouldn't have submitted this for republication. Your comments do bring out that a major part of the problem is that Wikipedia Enterprise and the WMF haven't offered an alternative to the current inefficient method being used. If that is the ultimate takeaway from this, so be it. I'll have some simpler questions that you may have info on that will get me (and others like me) a bit more up to speed, in maybe 30 minutes. Smallbones(smalltalk) 21:44, 9 April 2025 (UTC)
- @HaeB: Sorry it took so long to get back. 1st basic question. The Commons data is so large, having been built up over 20 or so years, that percent turnover can't be very big at all. So are the same bots returning day after day? Why? Will they someday get nearly "filled up" and stop? Or is this rush going to just keep on going? A second question is about non-photo data and the low quality of some of the photos. If the bots are just looking for photos for AI are they scraping everything? Or can they choose what they want to scrape? Finally, if a major part of the problem is within the WMF, do you have any suggestions on how to fix that problem? Smallbones(smalltalk) 01:56, 10 April 2025 (UTC)
- @HaeB: Thanks for providing this info. There is obviously a lot of it, most being from sources I'm not familiar with. It's the usual problem of Wikipedia being so large and conversations being so spread out. I think I do a good job keeping up with the press views about en.Wiki and much of the usual places for discussion on en.Wiki, but phabricator and facebook are not in my usual rounds. If I had that info I probably wouldn't have submitted this for republication. Your comments do bring out that a major part of the problem is that Wikipedia Enterprise and the WMF haven't offered an alternative to the current inefficient method being used. If that is the ultimate takeaway from this, so be it. I'll have some simpler questions that you may have info on that will get me (and others like me) a bit more up to speed, in maybe 30 minutes. Smallbones(smalltalk) 21:44, 9 April 2025 (UTC)
WE5.1 and WE5.2 there sound a lot like "the latest generation of clueless managers think API keys and replacing the Action API would work or be a good idea". 🙄 They never seem to listen to people who've been around longer that API keys can't work sensibly for web-based or open source applications, nor that getting rid of an API that handles 10,000 requests per second (literally) and powers tons of existing tools isn't a very feasible plan. Anomie⚔ 23:09, 9 April 2025 (UTC)
- While it isn't appropriate to block these bots for the reasons already outlined, I don't see any issue with a policy of rate-limiting bots which are putting a high demand on the infrastructure (with potential negative impacts on human readers and contributors). Then the bots get what they want, but it just takes somewhat longer. I also think it would be reasonable to limit the frequency of "whole of site" crawls in favour of getting the stream of updates after an initial whole-of-site crawl. And, as has been pointed out, there is a paid Enterprise service that is specifically designed for bots that want immediate access to recent changes etc. Kerry (talk) 01:20, 10 April 2025 (UTC)
Hi @Smallbones: I am Birgit, one of the authors of the blog post and responsible for WE5. I really appreciate you submitting this for republication and giving us the opportunity to discuss it here. I hope you still see some value in that too!
I wanted to give some context on why this is an industry wide problem, not something specific to Wikimedia Commons. Images are indeed a valuable resource for crawlers, but so is human-generated text. Scraping happens across the web and these users are likely to have an interest in using tools they can use for any site, rather than custom, Wikimedia-specific solutions.
Within our infrastructure, we observe scraping across wiki projects, and even on sites like Phabricator, Gitlab, Gerrit, or tools on Cloud Services. It’s the sum of bot traffic across all projects that makes up 65% of our most expensive traffic (as described in the blog post). We’re also observing scraping for content that is indeed provided through Enterprise’s services, or otherwise accessible through dumps.
I think HaeB is right that one way to address our developer and researcher communities’ needs specifically for Commons content could be to offer dumps, but that is not necessarily the case for companies across a growing industry.
This is a key reason why we need to explore different approaches and will require mechanisms to both encourage and enforce sustainable access. @Anomie: – just wanted to also clarify that this is the intent in 5.1 and 5.2, not retiring the very important Action API :-)
As Kerry mentions, we’re concerned about the bots which are putting a high demand on the infrastructure and causing traffic-related incidents that we have to deal with. We acknowledge that some users may require higher limits and will allow exceptions or refer these users to Enterprise as appropriate.
The intent of the Diff post was to make the problem clear, not provide all the answers. We’re hoping for input and support from the technical community as we learn and share more about this work over the coming months (for example at the Hackathon in early May). -BMueller (WMF) (talk) 19:36, 10 April 2025 (UTC)
- Thank you for that clarification. There have been too many people in the past who hadn't seemed to think beyond "it's not REST, so it's bad". 😀 Anomie⚔ 22:01, 10 April 2025 (UTC)
- Regarding the problem of cache misses for the articles that only the scrapers frequently seek, it is my understanding that ENWP Main Space amounts to only a few Gigabytes. So, can't the various continental and regional servers just cache them all? Jim.henderson (talk) 06:07, 11 April 2025 (UTC)
- @Jim.henderson: Thanks for the question! Unfortunately this would not solve the problem. Cache is not infinite, and we can’t predict which articles (or really: pages across name spaces, since crawlers tend to visit any url) from which wiki any given bot will scrape. Any content we may preemptively cache that is not used, also deprives cache storage from content that may be more frequently requested by readers, which in return would degrade their user experience. BMueller (WMF) (talk) 14:29, 11 April 2025 (UTC)
- @BMueller (WMF): is it possible to quantify how much it would cost for the cache to be made large enough to contain all rendered articles? — The Anome (talk) 13:37, 30 April 2025 (UTC)
- @Jim.henderson: Thanks for the question! Unfortunately this would not solve the problem. Cache is not infinite, and we can’t predict which articles (or really: pages across name spaces, since crawlers tend to visit any url) from which wiki any given bot will scrape. Any content we may preemptively cache that is not used, also deprives cache storage from content that may be more frequently requested by readers, which in return would degrade their user experience. BMueller (WMF) (talk) 14:29, 11 April 2025 (UTC)
quite a few users played a 1.5 hour long video of Carter's 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic
When someone watches Netflix the videos are chunked, and if you pause the video 5 seconds in you only get the first chunk. Seeking still works. Look at Dynamic Adaptive Streaming over HTTP and HTTP Live Streaming. If the WMF was using this tech then all those users would only get the chunks they actually watch. Downside is that it would probably require re-rendering the biggest videos. Polygnotus (talk) 11:25, 12 April 2025 (UTC)
- Coming up to speed on Wikimedia can be daunting, and for programmers trying to grab whatever they can from 100s of places on the Internet, they may follow the easy path so long as it reasonably works OK. Dumps, EventStream, Enterprise, etc.. are more advanced topics. Why be nice? There are various ways with edge equipment to detect and shape/limit users pulling too much data. It's probably already done to combat DDOS. -- GreenC 02:35, 16 April 2025 (UTC)
- One concern I have about recent developments in web scraping is the danger of well-intentioned scrapers getting caught up in measures meant to mitigate against malicious ones. The Internet Archive's Wayback Machine, for example, makes heavy use of scrapers, and lately I've been noticing that attempting to use their "Save Page Now" service on Wikipedia pages hasn't been working as well as it used to. It would be a shame if this were a casualty of the fight against other scrapers. Cooljeanius (talk) (contribs) 02:58, 26 April 2025 (UTC)
- The bulk of the data in commons is the media files, not the pages. Technically speaking, there is very little point in making compressed dumps of Commons content: almost all the files there are already pre-compressed (images, audio, video), or are small files like SVGs. tar archives would thus be ideal for the purpose of creating dumps, and would tend to be immutable over time with the exception of deletions or uploads of new revisions of existing files or pages, for which particular dump files could be regenerated if needed. Serving up huge numbers of multiple-gigabyte files is much more efficient than serving hundreds of times as many multi-megabyte files. And because they are just flat files, they can easily be stored in cooperating third-party file mirrors all over the world without any fancy technology setup.
So technically, this is not a difficult problem to solve. Practically and logistically, though, it's a massive one: it probably requires one or two more people to be hired to orchestrate it, lots more storage servers to be added to hold the data, and network engineering and ops to keep it all going. But that is one of the things the enormous amount of money the WMF has in its coffers should be there for. — The Anome (talk) 13:30, 30 April 2025 (UTC)
Opinion: Crawlers, hogs and gorillas (3,808 bytes · 💬)
This op-ed (shoehorned in after the Signpost's publication deadline) shows why we need better fact-checking of opinion pieces. E.g. regarding
There is an alternative for the corporations to obtain quick access to Wikipedia’s data while paying their fair share of the cost, WMF's own for-profit corporation Wikimedia Enterprise.
You would think so, but it is actually not true for the AI scraping requests that the WMF called out as particularly problematic in its Diff post (which conspicuously failed to mention Enterprise). See my notes here. Regards, HaeB (talk) 19:27, 9 April 2025 (UTC)
- And see my response there. Smallbones(smalltalk) 21:49, 9 April 2025 (UTC)
...and the ones that ignore the share-alike requirement, too, IMO! Cooljeanius (talk) (contribs) 03:03, 26 April 2025 (UTC)While they are at it, the WMF should impose fines when AI firms ignore the attribution requirement when using material scraped from WMF servers that’s licensed CC-BY-SA.
- To be best of my knowledge, the copyright owners of the text (in most cases, the author) and their legal representatives are the only ones who can enforce the CC licensing. The WMF doesn't own the copyright to our contributions, so there's not much to be done, unfortunately. GreenLipstickLesbian💌🦋 03:07, 26 April 2025 (UTC)
- The WMF could fund such a court case on behalf of a group of volunteers from the community. My impression is that the WMF saw BY and SA as barriers to getting maximum use of the data, hence their support for Wikidata as a way to convert much of Wikipedia to CC0. Of course there are downsides to free reuse by commercial organisations, including unsourced "facts" in Wikipedia becoming sourced from sites that are unattributed mirrors of Wikipedia. If the WMF could detach itself from the silicon valley mindset, then enforcement of attribution and share alike would be an obvious thing to do. Not just to protect the project's ability to fundraise and the integrity of our sourcing, but also because attribution and the promise of Share Alike are important motivators to many in the community. ϢereSpielChequers 08:17, 26 April 2025 (UTC)
- Sounds like something like the FSF's copyright assignment process, or one of the SFC's class-action lawsuits, or something, could be useful here... Cooljeanius (talk) (contribs) 00:01, 1 May 2025 (UTC)
- Any idea how that would work without jeopordizing their safe harbour status? I'd imagine it would be hard to stand up in court and say "no, we're not responsible for double checking the copyright status of user-uploaded works. Yes, we are also suing people for copyright infringement on the behalf of those users". But, then again, copyright law man. It's weird. GreenLipstickLesbian💌🦋 06:24, 1 May 2025 (UTC)
- The WMF could fund such a court case on behalf of a group of volunteers from the community. My impression is that the WMF saw BY and SA as barriers to getting maximum use of the data, hence their support for Wikidata as a way to convert much of Wikipedia to CC0. Of course there are downsides to free reuse by commercial organisations, including unsourced "facts" in Wikipedia becoming sourced from sites that are unattributed mirrors of Wikipedia. If the WMF could detach itself from the silicon valley mindset, then enforcement of attribution and share alike would be an obvious thing to do. Not just to protect the project's ability to fundraise and the integrity of our sourcing, but also because attribution and the promise of Share Alike are important motivators to many in the community. ϢereSpielChequers 08:17, 26 April 2025 (UTC)
- To be best of my knowledge, the copyright owners of the text (in most cases, the author) and their legal representatives are the only ones who can enforce the CC licensing. The WMF doesn't own the copyright to our contributions, so there's not much to be done, unfortunately. GreenLipstickLesbian💌🦋 03:07, 26 April 2025 (UTC)
Special report: Wikipedian and physician Ziyad al-Sufiani reportedly released from Saudi prison (2,927 bytes · 💬)
- A good reminder of what freedom really means, and what authoritarians are afraid of. No one should be afraid to edit Wikipedia. I'm glad Ziyad is free. —Ganesha811 (talk) 19:25, 9 April 2025 (UTC)
- @Ganesha811: Well said. Now let's keep praying for Osama, as well. Oltrepier (talk) 20:18, 10 April 2025 (UTC)
- The next step here must be political asylum if possible. No one should be going through this just because they criticize the shitty countries these people live in and the echo chambers they attempt to create. 🌙Eclipse (she/they/all neos • talk • edits) 23:03, 9 April 2025 (UTC)
- @LunaEclipse: It would certainly be good for Ziyad, but I don't really know which countries would step up to provide this kind of protection at the moment. Also, I don't want to set a dark tone, but -ish can hit the fan everywhere in the world, if people don't fight boldly enough for their rights... Oltrepier (talk) 20:25, 10 April 2025 (UTC)
- Thank you, Ziad and OsamaK, for your contributions to Wikipedia and your efforts to give everyone access to reliable information on important topics. — Newslinger talk 17:01, 12 April 2025 (UTC)
- Delighted to hear Ziyad has been released and I'm hoping Osama will be released soon enough as well. To echo Luna's comment, if seeking asylum is something Ziyad wants to do, I'd certainly hope the WMF would be willing to put some of their resources towards it. --Grnrchst (talk) 15:38, 19 April 2025 (UTC)
- As a wikipedian, I am very excited to hear this news. Thanks Allah. Md Mobashir Hossain (talk) 05:14, 21 April 2025 (UTC)
- How did the Saudi government get the names of Osama Khalid and Ziyad al-Sufiani? I notice that their user names are close to their real names. I hope other Saudi editors living in Saudi Arabia are able to edit safely. What does the Wikimedia Foundation (WMF) say about their safety? Does the WMF recommend that in-country Saudi editor stop editing? Maybe they should. And then erase all trace of their username on Wikimedia. I can think of many ways that police and security agencies can figure out who an editor is. --Timeshifter (talk) 08:09, 21 April 2025 (UTC)
Traffic report: Heigh-Ho, Heigh-Ho, off to report we go... (390 bytes · 💬)
- That's an interesting choice for the photo depicting Minecraft. Reconrabbit 19:20, 10 April 2025 (UTC)