Welcome to Resource Zone.

Any plans to resume RDF data updates? Please, please?

Elper

Curlie Admin
RZ Admin
Joined
Sep 15, 2004
All offers of help are welcome, but are easier to manage when from those who are already editors.
As informator said, at the moment we aren't Open Source, so rather limited.
By all means post in the new Public user discussion forum (it'll get more attention from active and technical editors than here) :)
 
Joined
Jun 8, 2016
Thank you for raising this topic internally at Curlie!
We are all looking forward eagerly to a positive decision.

As always, we all want to offer all our help to make this happen.
 

Elper

Curlie Admin
RZ Admin
Joined
Sep 15, 2004
Following up on this, are we right in assuming that the previous version of RDF is what is wanted, or are any other formats of interest?
 
Joined
Jun 8, 2016
Very sorry for the delay in responding.
I'm delighted that this topic is getting attention. Thank you!

I'd be perfectly happy with the current RDF format, as I have a parser already developed for parsing all the Curly (ODP) RDF files.

+1
 

arjenpdevries

New Member
Joined
Oct 10, 2020
Location
Nijmegen, The Netherlands
I second the request for this data to be available; quite a lot of scientific research in computing uses "the ODP dataset" and it would be great if we didn't have to take it from the Internet Archive any more, but be able to use an up-to-date version as you maintain! Would be very helpful. The exact format is not important, but leaving it as it has been is not a bad idea.
 

Elper

Curlie Admin
RZ Admin
Joined
Sep 15, 2004
All I can really do is continue to prod the other people involved in the decision making and technical processes... ;)
We are rather resource limited, so any ideas for economically creating and then hosting such large files would be welcome; either start a new thread here, or send PM... TIA
 

arjenpdevries

New Member
Joined
Oct 10, 2020
Location
Nijmegen, The Netherlands
I would be happy to organize in my institute (Radboud University) a mirror of these files for download, if you would need that support.

Also, using academictorrents could be an attractive route - we can even combine those two approaches, such that there's some level of guarantee that the dataset would be available.
 

tdfunk

New Member
Joined
Dec 4, 2020
Location
San Diego, CA
I'm throwing my hat in the ring to offer assistance.

If human resource constraints are an issue, I'm willing to assist as needed.

I've been periodically returning here (even to DMOZ, before this) for quite some time, looking for downloads of the RDF data (or other formats, say, JSON-LD).

I'm a long-time polyglot developer. What can I do to help?

This directory is too valuable to keep "locked up" in HTML. It's a treasure trove of curated data. The RDF data (and linked sites) could serve as labeled training data any number of ML projects.
 

informator

kEditall/kCatmv
Curlie Meta
Joined
Aug 19, 2003
Location
Sweden
Thanks for your offer of assistance!

Without being directly involved I think the technical people handling this issue have the resources needed. Also we prefer to use our internal editors when developing (althogh anyone is welcome to apply to be one).

I think there are internal support for the data to be made available again, it's just a matter of priorities now...
 

Elper

Curlie Admin
RZ Admin
Joined
Sep 15, 2004
If you'd like to see how Curlie is from the inside, by all means sign up as editor to a category that interests you. You'll then have access to our internal fora, and maybe add a few interesting sites to the directory at the same time...
(if you had a Dmoz or Curlie Editor account which has expired, they can be reactivated)
 

jmcc

New Member
Joined
Jan 3, 2021
Location
Ireland
The old RDF might be a bit risky to use as some of the sites may have been deleted and or reregistered since the last known good version. Link rot was a problem with the old RDF and Dmoz didn't scale well. The only way of dealing with it effectively is by tracking domain names. That's doable for the gTLDs but many ccTLDs do not publish their zones.

Regards...jmcc
 

informator

kEditall/kCatmv
Curlie Meta
Joined
Aug 19, 2003
Location
Sweden
Yes, any new data dump would be made of fresh data from the current directory.

We have automatic tools that audit listings but we also believe that a human touch is of big importance in spotting changed/dead urls. :)
 

jmcc

New Member
Joined
Jan 3, 2021
Location
Ireland
We have automatic tools that audit listings but we also believe that a human touch is of big importance in spotting changed/dead urls. :)
I took a quick look at the last known good version of the RDFs and the URLs. The gTLDs domain names were still the main group. From what I remember of Dmoz, it used to use a crawler to check domain names. That's a very time-consuming way of doing it. There are some indicators for changed and dead urls that can be checked more efficiently.

Regards...jmcc
 

Elper

Curlie Admin
RZ Admin
Joined
Sep 15, 2004
The last version of the RDF dates back several years now (and is used on dmoztools dot net), or isn't that what you meant?
There are some indicators for changed and dead urls that can be checked more efficiently.
We are all ears for increased efficiency :cool:
 

jmcc

New Member
Joined
Jan 3, 2021
Location
Ireland
Yes. That's the RDF I was using. Basically I have a database that tracks domain name transactions across the legacy and new gTLDS and some ccTLDs and periodically build IP address maps of all the sites for active domain names.

Regards...jmcc
 

sushidub

New Member
Joined
Feb 14, 2021
Location
Lakewood
Any chance you guys could leverage the Registry of Open Data service on AWS.
Their Open Data Sponsorship Program is specifically for public data providers and orgs such as curlie.
 
Top Bottom