Resource-Zone: All sites that use live ODP data are down now... - Resource-Zone

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

All sites that use live ODP data are down now... Rate Topic: -----

#1 browser007

  • Group: Guests

Posted 28 September 2003 - 12:44 PM

... while you can easily access dmoz.org itself.

What's happening? Are we not authorized to use odp data anymore?

You can do the test by yourself, go to:
http://dmoz.org/Comp...full-index.html
and pick a site of your choice: unless they cache some pages on their own server, there's no information from dmoz.org



#2 User is offline   giz 

  • Member
  • PipPipPip
  • Group: Inactive
  • Posts: 1,556
  • Joined: 26-May 02

Posted 28 September 2003 - 01:01 PM

Maybe they should be using http://ch.dmoz.org/ instead?
ODP Editor g1smd

#3 browser007

  • Group: Guests

Posted 28 September 2003 - 01:18 PM

Are you suggesting that if all these sites use ch.dmoz.org, there are no problems at all?





#4 browser007

  • Group: Guests

Posted 28 September 2003 - 02:53 PM

Apparently, a techie has blocked access to sites that use ODP data since Saturday, September 27 (last time my server cached a file was 27-Sep-2003 3:40 PM EST)

PS. It's not only about my site, but about ALL sites that use ODP live data :-(







#5 User is offline   windharp 

  • DMOZ Meta
  • Group: Meta
  • Posts: 4,405
  • Joined: 30-April 02
  • Editor Namewindharp

Posted 28 September 2003 - 10:32 PM

You did try to access dmoz.org by hand? Most likely it was no "techie" that blocked access but server load that prevents you from accessing it.

ch.dmoz.org is a bit out of date as is de.dmoz.org. Both are external mirrors at different places (hint: That are countrycodes in fromt of dmoz.org ;) ) so they don't put load on our main server.
ODP Meta Editor  windharp 
Wichtige Links: Deutsche ODP FAQ / Deutsche ODP-Richtlinien            Important Links: English ODP FAQ / English ODP Guidelines


Posted Image 

#6 User is offline   theseeker 

  • Member
  • PipPipPip
  • Group: Members
  • Posts: 306
  • Joined: 26-March 02
  • Editor Nametheseeker

Posted 28 September 2003 - 11:37 PM

When these types of scripts that use live data first showed up, dmoz staff asked the writers of the script to aim them at some other server, like the netscape servers (though I suspect that may not be allowed anymore either). But the programmers making the scripts wanted the most up to date data, and so eventually ignored that request.

Since the dmoz.org system was not made to handle a lot of traffic (that's why the data is distributed through the RDF), the number of robots and screen scraping programs have been slowing the servers down for quite some time. I'm quite surprised that it's taken this long, but from all the signs, I would say that sites taking data directly from the public servers are going to be blocked now.

I suggest exploring other avenues, like processing the RDF. The mirror servers are probably not the solution. They are provided free of charge, and the people providing them are probably not going to keep providing the types of resources it would take to satisfy all the sites that want live data.

:monacle:

#7 browser007

  • Group: Guests

Posted 29 September 2003 - 01:46 AM

Thank you for this information.
I'm trying to find a way to chop these huge RDF dumps in manageable pieces, instead of using ODP live data.

P.S. I know what the problem is with these spiders:

When Googlebot or another SE spider comes along on my site to index pages, this bot leaves traces in MY server log files. But the bot will use my program to request pages on dmoz.org, so in the log files of dmoz.org, there is no trace of the Googlebot, but instead, it appears that my site is an unknown robot that abuses the dmoz.org server - and they block access to my site :-(





#8 User is offline   windharp 

  • DMOZ Meta
  • Group: Meta
  • Posts: 4,405
  • Joined: 30-April 02
  • Editor Namewindharp

Posted 29 September 2003 - 02:10 AM

At least Googlebot is very obedient regarding the robots.txt files. Since it is unnecessary to spider borrowed content anyway, such should be excluded by default. [My oppinion!]
ODP Meta Editor  windharp 
Wichtige Links: Deutsche ODP FAQ / Deutsche ODP-Richtlinien            Important Links: English ODP FAQ / English ODP Guidelines


Posted Image 

#9 browser007

  • Group: Guests

Posted 29 September 2003 - 05:41 AM

windharp, I agree with you: spidering "borrowed content" is not quite fair. It's too late now to exclude bots, because my site has no access to dmoz.org anymore.
Wish I did that from the beginning...


Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users


Copyright © 2002 - 2010 Contributing ODP editors - All Rights Reserved.
This site is in no way affiliated with Netscape Communications Corporation. This site is here as a service to directory users and is maintained by a group of editors.