Trying to crawl your site gives 500 internal server error?

Hello sparkfun,

For a school assignment I have to make a nice simple list of different available sensors. They also want to know which (online) shops sell what sensors.

Now that would be a Pain in the behind to maintain that manually. So I thought up of the following: make a very simple crawler, specific for a few sites (but able to be expanded for other sites as well).

So it starts at the basics,

<?php
$doc = new DOMDocument()l
$doc->loadHTMLFile("http://www.sparkfun.com/");
?>

But that gives me:

Warning: DOMDocument::loadHTMLFile(http://www.sparkfun.com/) [function.DOMDocument-loadHTMLFile]: failed to open stream: HTTP request failed! HTTP/1.1 500 Internal Server Error in D:\webserver\htdocs\school\core\crawler.class.php on line 19

Warning: DOMDocument::loadHTMLFile() [function.DOMDocument-loadHTMLFile]: I/O warning : failed to load external entity "http://www.sparkfun.com/" in D:\webserver\htdocs\school\core\crawler.class.php on line 19

So, before continuing, I would like to know AM I allowed to crawl your site for this purpose (never bad to ask for such a thing, right? :P) secondly, why do I get the 500status >_< I tried google, no problem although it gives a sh*t load of tag errors :stuck_out_tongue:

And, perhaps a more rude question, would it be possible for you to supply a datafile with a certain dump of your products database? Would make it even more easy :slight_smile:

Thanks in advance,

Daan Timmer

p.s. Why can I post this topic as a sticky? bug?

Instead of pulling down potentially megabytes of data, most of which you won’t need, why not download their PDF catalog?

http://www.sparkfun.com/commerce/downlo … atalog.pdf

Well, hehe, megabytes O.- I don’t think we will reach that very fast.

The HTML is nice and clean, so.

When crawling I don’t pull any images, only the HTML.

I only want a certain portion of the products (sensors category).

And I wouldend mind using the PDF catalog, but that is 4.5MB in size, that way we pull way more data off the servers.

It is also not like we fetch data on each request. More like every once a week or something an update that takes less then 1minute to complete.

Another thing, if you can provide me with a PHP-PDF reader that can extract the data the way I need it then I would go use the PDF immediatly.

I’ve found a workaround that works.

Will be waiting for a reply from official sparkfun member to tell me if I will be allowed to use this data or not.

Email them.

Sure, no problem, if there would be an email address to mail to…

I’ve searched the site but there was no email address >< Only about technical (product) issues ><

Or I am blind! That is possible too of course! :smiley:

http://www.sparkfun.com/commerce/popup_feedback.php

Hope this helps!

Daan Timmer,

Try this:

<?php
$doc = new DOMDocument()l
$doc->loadHTMLFile("http://www.sparkfun.com/commerce/");
?>

There’s a lot of stuff on the SparkFun domain that you’re not going to have access to. The entire storefront lives under that directory.

Hope that helps!