Thematic link boxes throw off importance computation #85

lonvia · 2024-12-14T09:19:57Z

Wikipedia articles feature more and more of these "related features" link boxes. Example: https://en.wikipedia.org/wiki/Cortland_County,_New_York has a whole collection of links to all kinds of topics related to the state of New York.

These boxes throw off importance computation because they add backlinks without them being topic-wise really connected to the article.

Any chance we can filter them out?

(Issue originally reported: https://community.openstreetmap.org/t/coordinates-for-cortland-ny-are-actually-for-mcgraw-ny/122894)

mtmail · 2025-01-10T12:44:34Z

It doesn't seems possible to detect if a page contains such template. Even less how many links those templates contain.

Cortland_County,_New_York is namespace=0, page_id=54164 and contains "Transcluded templates (177)"

https://www.mediawiki.org/wiki/Manual:Templatelinks_table

zcat enwiki-20241201-templatelinks.sql.gz | mysqldump_to_csv.py | gzip -9 > templatelinks.csv.gz
zgrep -c ^54164, templatelinks.csv.gz
178
zgrep  ^54164, templatelinks.csv.gz
54164,0,7
54164,0,14
54164,0,48
54164,0,49
54164,0,80
54164,0,81
54164,0,82
54164,0,87
54164,0,88
54164,0,89
54164,0,90
[...]

Neither the template for the county
https://en.wikipedia.org/w/index.php?title=Template:Cortland_County,_New_York&action=info
Namespace ID 10, Page ID 12196125
nor the state
https://en.wikipedia.org/w/index.php?title=Template:New_York_(state)&action=info
Namespace ID 10, Page ID 599554
are listed for 54164 in the templatelinks file.

1ec5 · 2025-01-10T17:21:27Z

Would it be possible to factor in backlinks to the associated Wikidata item? Coverage may be sparser in some places, but it would probably be less noisy from issues like this.

mtmail · 2025-01-11T23:16:50Z

Possibly. Wikidata collects statements about its items. One place having more statements than another might indicate it's more important or popular (or just has more data coverage or wiki editor interest).

https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer
https://www.wikidata.org/wiki/Wikidata:Database_download

A full dump seems to be 130GB (I'm guessing 90% compression rate since it's JSON text). 1.5 billion item statements. It's a huge project requiring patience (waiting for long-running data processing to finish). Maybe we suggest it as a Google Summer of Code.

Currently we fetch some item lists via a Wikidata API. That would be too slow to do for all places.

1ec5 · 2025-01-11T23:54:50Z

This is pretty much the entire purpose of QRank: #10. But it ranks by page view instead of page rank, hence the caveat emptor in #10 (comment).

1ec5 · 2025-01-12T00:00:54Z

Wikipedia’s search engine has its own low-level API, as well as JSON dumps of the index. I wonder if it would be feasible to piggyback on whatever Wikipedia is doing to boost or penalize certain articles, which can be more nuanced than pure backlink counting.

lonvia · 2025-01-12T16:14:00Z

Those cirrus-search dumps definitely look interesting. We'd need to figure out how to translate this into our importance scale but that should be doable.

ImreSamu · 2025-01-12T17:12:29Z

@mtmail :

A full dump seems to be 130GB (I'm guessing 90% compression rate since it's JSON text).

(wikidata json dump)
If there’s no better solution, I have a partial process for importing
a Wikidata JSON dump into Postgres/PostGIS - and I have some experience with it as well.
However, I probably won’t have much time to work on this in the next 1-2 months.

An ideal solution might be to have a third-party service pre-filter the Wikidata JSON dump to only include the IDs referenced in OSM or needed by Nominatim.
This way, we would only need to download and import the filtered data into Postgres.
I’m thinking about it. 🤔

mtmail · 2025-01-13T14:49:34Z

My Nominatim database contains 2,548,027 unique wikidata ids. But I'd try to create a list from a raw planet.pbf dump file because other Nominatim admins might import different or more places (for example I'm not interested in election borders or some maritime borders).

The wikidata dump took 10 hours to download. Slow but would be ok to do monthly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thematic link boxes throw off importance computation #85

Thematic link boxes throw off importance computation #85

lonvia commented Dec 14, 2024

mtmail commented Jan 10, 2025

1ec5 commented Jan 10, 2025

mtmail commented Jan 11, 2025

1ec5 commented Jan 11, 2025 •

edited

Loading

1ec5 commented Jan 12, 2025

lonvia commented Jan 12, 2025

ImreSamu commented Jan 12, 2025

mtmail commented Jan 13, 2025

Thematic link boxes throw off importance computation #85

Thematic link boxes throw off importance computation #85

Comments

lonvia commented Dec 14, 2024

mtmail commented Jan 10, 2025

1ec5 commented Jan 10, 2025

mtmail commented Jan 11, 2025

1ec5 commented Jan 11, 2025 • edited Loading

1ec5 commented Jan 12, 2025

lonvia commented Jan 12, 2025

ImreSamu commented Jan 12, 2025

mtmail commented Jan 13, 2025

1ec5 commented Jan 11, 2025 •

edited

Loading