-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thematic link boxes throw off importance computation #85
Comments
It doesn't seems possible to detect if a page contains such template. Even less how many links those templates contain. Cortland_County,_New_York is namespace=0, page_id=54164 and contains "Transcluded templates (177)" https://www.mediawiki.org/wiki/Manual:Templatelinks_table
Neither the template for the county |
Would it be possible to factor in backlinks to the associated Wikidata item? Coverage may be sparser in some places, but it would probably be less noisy from issues like this. |
Possibly. Wikidata collects statements about its items. One place having more statements than another might indicate it's more important or popular (or just has more data coverage or wiki editor interest). https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer A full dump seems to be 130GB (I'm guessing 90% compression rate since it's JSON text). 1.5 billion item statements. It's a huge project requiring patience (waiting for long-running data processing to finish). Maybe we suggest it as a Google Summer of Code. Currently we fetch some item lists via a Wikidata API. That would be too slow to do for all places. |
This is pretty much the entire purpose of QRank: #10. But it ranks by page view instead of page rank, hence the caveat emptor in #10 (comment). |
Wikipedia’s search engine has its own low-level API, as well as JSON dumps of the index. I wonder if it would be feasible to piggyback on whatever Wikipedia is doing to boost or penalize certain articles, which can be more nuanced than pure backlink counting. |
Those cirrus-search dumps definitely look interesting. We'd need to figure out how to translate this into our importance scale but that should be doable. |
@mtmail :
(wikidata json dump) An ideal solution might be to have a third-party service pre-filter the Wikidata JSON dump to only include the IDs referenced in OSM or needed by Nominatim. |
My Nominatim database contains 2,548,027 unique wikidata ids. But I'd try to create a list from a raw planet.pbf dump file because other Nominatim admins might import different or more places (for example I'm not interested in election borders or some maritime borders). The wikidata dump took 10 hours to download. Slow but would be ok to do monthly. |
Wikipedia articles feature more and more of these "related features" link boxes. Example: https://en.wikipedia.org/wiki/Cortland_County,_New_York has a whole collection of links to all kinds of topics related to the state of New York.
These boxes throw off importance computation because they add backlinks without them being topic-wise really connected to the article.
Any chance we can filter them out?
(Issue originally reported: https://community.openstreetmap.org/t/coordinates-for-cortland-ny-are-actually-for-mcgraw-ny/122894)
The text was updated successfully, but these errors were encountered: