Wikipedia's category hierarchy forms a graph. It's definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).
At any rate, did you know that "Chicago Stags coaches" are a subcategory of "Natural sciences"? If you don't believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:
- Subfields of zoology
- Human behavior
- Ball games
- Basketball teams
- Defunct basketball teams
- Defunct National Basketball Association teams
- Chicago Stags
- Chicago Stags coaches
So if you're trying to actually find pages about Natural sciences, maybe it's enough to limit the depth of your breadth first search down the graph.
This is sort of reasonable, and things up to and including depth four are quite reasonable, including topics like "Neurochemistry", "Planktology" and "Chemical elements". There are a few outliers, like "Earth observation satellites of Israel" which you could certainly make a case might not be natural science.
At depth five, things become much more mixed. On the one hand, you get categories you might like to include, like "Statins", "Hematology", "Lagoons" and "Satellites" (interesting that Satellites is actually deeper than the Isreal thing). But you also get a roughly equal amount of weird things, like "Animals in popular culture" and "Human body positions". It's still not 50/50, but it's getting murky.
At depth six, based on my quick perusal, it's about 50/50.
And although I haven't tried it, I suspect that if you use a starting point other than Natural sciences, the depth at which things get weird is going to be very different.
So I guess the question is how do deal with this.
One thought is to "hope" that editors of Wikipedia pages will list the categories of pages roughly in order of importance, so that you can assume that the first category listed for a page is "the" category for that page. This would render the structure to be a tree. For the above example, this would cut the list at "Subfields of zoology" because the first listed category for the Ethology category is "Behavioral sciences", not "Subfields of zoology."
Doing this seems to make life somewhat better; you cut out the stags coaches, but you still get the "Chicago Stags draft picks" (at depth 17). The path, if you care, is (Natural sciences -> Physical sciences -> Physics -> Fundamental physics concepts -> Matter -> Structure -> Difference -> Competition -> Competitions -> Sports competitions -> Sports leagues -> Sports leagues by country -> Sports leagues in the United States -> Basketball leagues in the United States -> National Basketball Association -> National Basketball Association draft picks). Still doesn't feel like Natural sciences to me. In fairness, at depth 6, life is much better. You still get "Heating, ventilating, and air conditioning" but many of the weird entries have gone away.
Another idea is the following. Despite not being a tree or DAG, there is a root to the Wikipedia hierarchy (called Category:Contents). For each page/category you can compute it's minimum depth from that Contents page. Now, when you consider subpages of Natural sciences, you can limit yourself to pages whose shortest path goes through Natural sciences. Basically trying to encode the idea that if the shallowest way to reach Biology is through Natural sciences, it's probably a natural science.
This also fails. For instance, the depth of "Natural sciences" (=5) is the same as the depth of "Natural sciences good articles", so if you start from Natural sciences, you'll actually exclude all the good articles! Moreover, even if you insist that a shortest path go through Natural sciences, you'll notice that many editors have depth 5, so any page they've edited will be allowed. Maybe this is a fluke, but "Biology lists" has depth of only 4, which means that anything that can be reached through "Biology lists" would be excluded, something we certainly wouldn't want to do. There's also the issue that the hierarchy might be much bushier for some high-level topics than others, which makes comparing depths very difficult.
So, that leaves me not really knowing what to do. Yes, I could compute unigram distributions over the pages in topics and cut when those distributions get too dissimilar, but (a) that's annoying and very computationally expensive, (b) requires you to look at the text of the pages which seems silly, (c) you now just have more hyperparameters to tune. You could annotate it by hand ("is this a natural science") but that doesn't scale. You could compute the graph Laplacian and look at flow and use "average path length" rather than shortest paths, but this is a pretty big graph that we're talking about.
Has anyone else tried and succeed at using the Wikipedia category structure?