Resource-Zone: need a program to parse odp PLEASE - Resource-Zone

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

need a program to parse odp PLEASE Rate Topic: -----

#1 martin30

  • Group: Guests

Posted 09 December 2003 - 11:33 PM

I can't find a program that actually works to parse odp dumps. I don't want to search dmoz in real time but to create my own sql database.

I couldn't get anything useful in http://dmoz.org/Comp...ta/Upload_Tools

Does anybody know how to export the results to a sql database?


#2 User is offline   samiam 

  • Member
  • PipPipPip
  • Group: Inactive
  • Posts: 67
  • Joined: 02-April 02

Posted 10 December 2003 - 01:23 AM

I haven't used it, but is http://www.ohardt.co...puter/dev/java/ "ODP Data Parser
A simple XML parser that inserts the ODP strucure into MySQL DB
odp.reader-0.1.zip"

along the lines of what you're looking for?

Also take a look at http://rainwaterrept...dp/rdflist.html

#3 martin30

  • Group: Guests

Posted 10 December 2003 - 08:09 AM

Thank you! But those programs parse the odp dump while you are on a server, and since I am not, I would have to upload the whole dmoz file when I am only wanting to parse a relatively small part of the directory.



#4 User is offline   hmf 

  • Member
  • Group: Inactive
  • Posts: 5
  • Joined: 05-August 03

Posted 15 December 2003 - 01:08 AM

Then your bets bet is tulipchain. (http://ostermiller.org/tulipchain/)
It reads odp content throgh the web interface. This makes
sense if you are only intersted in your categories data.
The approach of tulipchain is not usable for parsing larger parts of the directory)

However it does not insert its data into a database. I am currently working on such a feature. However I am using Berkeley DB/Object database, because mySQL is far too slow on hierarchical data.

#5 User is offline   theseeker 

  • Member
  • PipPipPip
  • Group: Members
  • Posts: 306
  • Joined: 26-March 02
  • Editor Nametheseeker

Posted 15 December 2003 - 01:49 AM

TulipChain is editor only I believe. :monacle:

#6 User is offline   windharp 

  • DMOZ Meta
  • Group: Meta
  • Posts: 4,405
  • Joined: 30-April 02
  • Editor Namewindharp

Posted 15 December 2003 - 04:48 AM

Nope, TulipChain is free for all. Several of its advanced features are available to editors only, but the program itself can be used by anyone.

But as stated above it does not solve the problem. It can produce HTML output of a complete tree with some tricks, but _if_ you want to spider a small part of the ODP I suggest spidering it directly, not with a tool like this that produces output that has to be spidered again.
ODP Meta Editor  windharp 
Wichtige Links: Deutsche ODP FAQ / Deutsche ODP-Richtlinien            Important Links: English ODP FAQ / English ODP Guidelines


Posted Image 

#7 User is offline   bobrat 

  • Member
  • PipPipPip
  • Group: Inactive
  • Posts: 5,531
  • Joined: 15-April 03

Posted 15 December 2003 - 09:59 AM

Parsing and extracting a subset of data fromthe RDF dumps is not in itself time comsuming. My experience is that downloading the full RDF dump, and extracting a subset, is fairly easy and fast to do. Adding the data to mySQL, which is what I have been doing, seems to be the time killer. [Which is why I was only extracting a subset for one area of ODP]


/\/\/\
:icon_arro Sorry, but private messages to me about site status will be ignored :icon_excl
Dummies Guide to dmoz.org
There is no queue - there is no schedule :confuse2:

#8 User is offline   theseeker 

  • Member
  • PipPipPip
  • Group: Members
  • Posts: 306
  • Joined: 26-March 02
  • Editor Nametheseeker

Posted 15 December 2003 - 12:23 PM

Quote

Adding the data to mySQL, which is what I have been doing, seems to be the time killer.


I extract most of the information and put it into mySQL tables, though my programs are very editor oriented so probably wouldn't be of much use to anyone else. But one way I speed this up is to extract the information into a tab-delimited flat file--for example, if I have a table with URL, Title, Description, and Category the each site would be on one line, with the URL, title, description and category separated by tabs.

Once I'm finished writing all the data to flat files, I then Truncate the table (or delete everything from the table), and then I use LOAD DATA LOCAL INFILE "flatfile.txt" INTO TABLE tablename

I don't use an index on the table containing the 4 mil+ sites, but if you do, you don't want the index there until you've loaded all the sites into the table. Then create the index.

One other very big time (and space) saver, is to create a catid table. This table should have only two fields: catid and path, where catid is the cat id in the RDF for each category, and path is the full path of the category, like Arts/Online_Writing. Then, in any other table where you need to specify a category, use the catid. When you query a table with a catid, join it with the catid table to get the path.

Using those techniques, and a few others that are more complicated to explain, I've reduced the parsing time to under an hour. :monacle:

#9 martin30

  • Group: Guests

Posted 15 December 2003 - 03:29 PM

Thank you hmf

That is exactly what I was looking for! Great!
:D

#10 User is offline   nakulgoyal 

  • Member
  • PipPip
  • Group: Inactive
  • Posts: 13
  • Joined: 08-November 03
  • Editor Namenakulgoyal

Posted 10 April 2004 - 10:49 PM

Thanks hmf !! The info you provided was useful for me as well. Just for General Knowledge you see :-)

#11 User is offline   giz 

  • Member
  • PipPipPip
  • Group: Inactive
  • Posts: 1,556
  • Joined: 26-May 02

Posted 11 April 2004 - 04:00 AM

>> But those programs parse the odp dump while you are on a server, and since I am not, I would have to upload the whole dmoz file when I am only wanting to parse a relatively small part of the directory. <<

You could always run a local copy of Apache, PHP, and mySQL, so that it appears that you are on a server. Access it through http://localhost/ or http://127.0.0.1/ etc.
ODP Editor g1smd

#12 User is offline   nakulgoyal 

  • Member
  • PipPip
  • Group: Inactive
  • Posts: 13
  • Joined: 08-November 03
  • Editor Namenakulgoyal

Posted 11 April 2004 - 11:57 AM

Quote

>> But those programs parse the odp dump while you are on a server, and since I am not, I would have to upload the whole dmoz file when I am only wanting to parse a relatively small part of the directory. <<

You could always run a local copy of Apache, PHP, and mySQL, so that it appears that you are on a server. Access it through http://localhost/ or http://127.0.0.1/ etc.


Good idea g1smd !! I will just try it !!

#13 User is offline   marengo 

  • Member
  • Group: Inactive
  • Posts: 8
  • Joined: 18-June 04

Posted 18 June 2004 - 01:56 AM

I am making my online directory for webmasters now - http://bestcatalog.net/ and I used Extreme Dmoz Extractor (by Nicecoder): very good application. http://www.nicecoder...z_extractor.php

Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users


Copyright © 2002 - 2010 Contributing ODP editors - All Rights Reserved.
This site is in no way affiliated with Netscape Communications Corporation. This site is here as a service to directory users and is maintained by a group of editors.