JasonTimmins Posted July 5, 2009 Posted July 5, 2009 Hi There, Over the past month or so, I've noticed that the RDF files I use to drive my site contain some strange data. Specifically, many (59 in this week's version of content.rdf) occurances of CatID 1 (<catid>1</catid> in the XML.) This is many more than the handful usually found as the top-level catagories in the DMOZ hierarchy. Here's an example... <Topic r:id="Top/Business/Textiles_and_Nonwovens/Industrial_Yarns_and_Sewing_Threads/Carpet_Yarns"> /Regional/North_America/United_States/Georgia/Localities/A/Athens/Arts_and_Entertainment/Museums I also see that there are 53 occurances of CatID records in structure.rdf. This has really made a mess of my DMOZ-based directory, <URL Removed>. Have a look for yourself... nasty. Can someone take a look at the files and verify my findings? I've downloaded the files each week for the last six weeks or so and they've all contained this type of corruption. What's going on? Cheers Jason.
jimnoble Posted July 5, 2009 Posted July 5, 2009 What's going on is a known bug which is being investigated by AOL's systems engineers. I can't give a time scale for its resolution I'm afraid .
sharonfranz Posted July 22, 2009 Posted July 22, 2009 What's going on is a known bug which is being investigated by AOL's systems engineers. I can't give a time scale for its resolution I'm afraid . Do you know if this only affects the most recent dump? Does the previous month's dump have this bug too? Thanks.
RZ Admin photofox Posted July 22, 2009 RZ Admin Posted July 22, 2009 I believe (and don't quote me on this) that the last available good RDF dump is http://rdf.dmoz.org/rdf/archive/2009-04-07/ The files since that dump are either not available in the archive or problems have been reported with them (i.e. the CatID issue). Curlie Admin photofox
sharonfranz Posted July 22, 2009 Posted July 22, 2009 Thanks! I'll try that one out. I'm only using the categories (structure.rdf). Do you know any good resources on the best approach to work with the data on a database level? I'm using PHP/MySQL. I've already imported the data into the database. I took at peek at the tables. To display the sub-categories for each topic, I'm going by the fatherid. I'm new at this, so I'm just wondering if anyone has a better approach. Thanks. I believe (and don't quote me on this) that the last available good RDF dump is http://rdf.dmoz.org/rdf/archive/2009-04-07/ The files since that dump are either not available in the archive or problems have been reported with them (i.e. the CatID issue).
JasonTimmins Posted July 23, 2009 Author Posted July 23, 2009 Hi There, The problem seemed to start around April but the current file has this at line 6... <!-- Generated at 2009-06-01 14:19:20 GMT on core-n01 --> ...which seems to indicate that they are not even bothering to update the RDFs at the moment. As I write, there are still 59 entries for CatID=1 in this week's structure.rdf. <mutter> Bye for now Jason.
jimnoble Posted July 23, 2009 Posted July 23, 2009 they are not even bothering to update the RDFs at the moment Please see what I wrote in Post #2 of this thread. The problem's resolution is being actively pursued.
JasonTimmins Posted July 23, 2009 Author Posted July 23, 2009 I don't get what you're trying to say Jim. I assume that that line in the XML is the creation date of the file. If it is, clearly, it's not been created since the 1st of June. It looks like the data has been broken since April. I'd like to think that an organisation the size of AOL is capable of 'actively pursuing' a 'known bug' in less than 4 month. Jeeez.
jimnoble Posted July 23, 2009 Posted July 23, 2009 I'm not an AOL employee but I recognise that, as with many other companies in these difficult times, resources are lower than one might wish and that their efforts have to be prioritised. As I said previously, AOL engineers are working on the issue but we can't predict when it will be resolved. If you're dissatisfied with the situation, I suggest that you ask for a refund .
sharonfranz Posted July 23, 2009 Posted July 23, 2009 Hehe... I'm surprised AOL is even using resources on DMOZ since it obviously doesn't generate revenue for them. I'm wondering if DMOZ should go the route of Wikipedia and ask for donations. This seems to work really well for non-profits. Figure out how many people it'll take to run this place, stick a goal (an amount), and ask for donations. Take a look at refdesk.com. It's way smaller than DMOZ, but he's able to get enough contributions to keep his site running. Just a suggestion. I'm not an AOL employee but I recognise that, as with many other companies in these difficult times, resources are lower than one might wish and that their efforts have to be prioritised. As I said previously, AOL engineers are working on the issue but we can't predict when it will be resolved. If you're dissatisfied with the situation, I suggest that you ask for a refund .
sharonfranz Posted July 23, 2009 Posted July 23, 2009 Just imported the data. I'm using Josser to import the data into my database. I'm still getting multiple entries for CatID, which is supposed to be unique, right? Is it just the Josser utility or another corrupt data dump? Anyone else used Josser? Thx. I believe (and don't quote me on this) that the last available good RDF dump is http://rdf.dmoz.org/rdf/archive/2009-04-07/ The files since that dump are either not available in the archive or problems have been reported with them (i.e. the CatID issue).
sharonfranz Posted February 18, 2010 Posted February 18, 2010 Hi Jim, Has this problem been resolved? Just wondering. Thx.
RZ Admin Elper Posted February 18, 2010 RZ Admin Posted February 18, 2010 The main rdf issues (including the Duplicate CatID) are afaik fixed... There may still a glitch with a colon : too many, but the rdf (of 19 January 2010) should be usable. elper {moz}:curlie: All opinions expressed are my own, and do not necessarily represent the official point of view of the administration of either this forum or the directory.
scottie Posted March 16, 2011 Posted March 16, 2011 I have python code (runs in 2.6.5) that crawls through the structure files and identifies a class of mistaken entries. Essentially, there should only be a few entries ('' and 'Top' in the default structure) where "/" + the <d:Title> entry doesn't end the <Topic>'s r:id field. I've gone back into the .rdf.u8 and looked at those 78 entries and they all seem to be mistakes. If anyone is interested, I'd be happy to give them the code to run the check on the generated objects. It takes minutes to run, so it is not _that_ horrible.
jimnoble Posted April 26, 2011 Posted April 26, 2011 @naikvarda: I've moved your post to the Using ODP Data forum.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now