Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thematic link boxes throw off importance computation #85

Open
lonvia opened this issue Dec 14, 2024 · 8 comments
Open

Thematic link boxes throw off importance computation #85

lonvia opened this issue Dec 14, 2024 · 8 comments

Comments

@lonvia
Copy link
Member

lonvia commented Dec 14, 2024

Wikipedia articles feature more and more of these "related features" link boxes. Example: https://en.wikipedia.org/wiki/Cortland_County,_New_York has a whole collection of links to all kinds of topics related to the state of New York.

These boxes throw off importance computation because they add backlinks without them being topic-wise really connected to the article.

Any chance we can filter them out?

(Issue originally reported: https://community.openstreetmap.org/t/coordinates-for-cortland-ny-are-actually-for-mcgraw-ny/122894)

@mtmail
Copy link
Contributor

mtmail commented Jan 10, 2025

It doesn't seems possible to detect if a page contains such template. Even less how many links those templates contain.

Cortland_County,_New_York is namespace=0, page_id=54164 and contains "Transcluded templates (177)"

https://www.mediawiki.org/wiki/Manual:Templatelinks_table

zcat enwiki-20241201-templatelinks.sql.gz | mysqldump_to_csv.py | gzip -9 > templatelinks.csv.gz
zgrep -c ^54164, templatelinks.csv.gz
178
zgrep  ^54164, templatelinks.csv.gz
54164,0,7
54164,0,14
54164,0,48
54164,0,49
54164,0,80
54164,0,81
54164,0,82
54164,0,87
54164,0,88
54164,0,89
54164,0,90
[...]

Neither the template for the county
https://en.wikipedia.org/w/index.php?title=Template:Cortland_County,_New_York&action=info
Namespace ID 10, Page ID 12196125
nor the state
https://en.wikipedia.org/w/index.php?title=Template:New_York_(state)&action=info
Namespace ID 10, Page ID 599554
are listed for 54164 in the templatelinks file.

@1ec5
Copy link

1ec5 commented Jan 10, 2025

Would it be possible to factor in backlinks to the associated Wikidata item? Coverage may be sparser in some places, but it would probably be less noisy from issues like this.

@mtmail
Copy link
Contributor

mtmail commented Jan 11, 2025

Possibly. Wikidata collects statements about its items. One place having more statements than another might indicate it's more important or popular (or just has more data coverage or wiki editor interest).

https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer
https://www.wikidata.org/wiki/Wikidata:Database_download

A full dump seems to be 130GB (I'm guessing 90% compression rate since it's JSON text). 1.5 billion item statements. It's a huge project requiring patience (waiting for long-running data processing to finish). Maybe we suggest it as a Google Summer of Code.

Currently we fetch some item lists via a Wikidata API. That would be too slow to do for all places.

@1ec5
Copy link

1ec5 commented Jan 11, 2025

This is pretty much the entire purpose of QRank: #10. But it ranks by page view instead of page rank, hence the caveat emptor in #10 (comment).

@1ec5
Copy link

1ec5 commented Jan 12, 2025

Wikipedia’s search engine has its own low-level API, as well as JSON dumps of the index. I wonder if it would be feasible to piggyback on whatever Wikipedia is doing to boost or penalize certain articles, which can be more nuanced than pure backlink counting.

@lonvia
Copy link
Member Author

lonvia commented Jan 12, 2025

Those cirrus-search dumps definitely look interesting. We'd need to figure out how to translate this into our importance scale but that should be doable.

@ImreSamu
Copy link

@mtmail :

A full dump seems to be 130GB (I'm guessing 90% compression rate since it's JSON text).

(wikidata json dump)
If there’s no better solution, I have a partial process for importing
a Wikidata JSON dump into Postgres/PostGIS - and I have some experience with it as well.
However, I probably won’t have much time to work on this in the next 1-2 months.

An ideal solution might be to have a third-party service pre-filter the Wikidata JSON dump to only include the IDs referenced in OSM or needed by Nominatim.
This way, we would only need to download and import the filtered data into Postgres.
I’m thinking about it. 🤔

@mtmail
Copy link
Contributor

mtmail commented Jan 13, 2025

My Nominatim database contains 2,548,027 unique wikidata ids. But I'd try to create a list from a raw planet.pbf dump file because other Nominatim admins might import different or more places (for example I'm not interested in election borders or some maritime borders).

The wikidata dump took 10 hours to download. Slow but would be ok to do monthly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants