How to collect information for Business Intelligence project ? :(

  • 3 Replies


  • Roomba
  • *
  • 1
Hello ,
            I'm a college student and as part of my Business Intelligence course, I have to apply learning algorithms to a "large" dataset.

            Since I was doing it anyway, I thought... " why not use it to study something I am curious about " So I want to learn about the CAPTCHA statistics (how often do users fail it)

            Anyway, I dont know how I am going to get a hold of relevant stats. A related study by some kids from Caltech used ebay's statistics (they said so in their paper) ... but I dont see how one gets access to that. If I write a survey and email it to all the people I know , yeah.. not very dependable data. Should I write a crawler that trolls the web ? Is that even legal :/

Thanks for your advice guys, I do appreciate it :)

PS: Yes, I know there have been many such studies before, but it'll get me a better grade and help me learn something. I'm not aiming for originality here...



  • At the end of the game, the King and Pawn go into the same box.
  • Trusty Member
  • **********************
  • Colossus
  • *
  • 5864
Re: How to collect information for Business Intelligence project ? :(
« Reply #1 on: December 19, 2012, 11:38:21 pm »
You should direct the majority of your questions to your college professor especially with regard to legality issues within your state or even the country.
Trolling, bots while common in some areas are often frowned upon especially by site owners / hosters. Yes, these are the same sites that apparently have no shame for putting cookies on your computer (tracking or otherwise). ;)

KittenAuth was also tried for a time but no real acceptance.

Personally, I detest Captcha and it's twisted, distorted characters that often make one have to try several attempts before getting it right! I'd go out on a limb and say that the average person (not too elderly) takes at least 2 tries (if not 3) in order to get a Chaptcha correct!! This is just my $.02 so do with it as you will.

There are some countries that actually PAY humans between $.50 - $1.20 USD for every 1,000 Captcha's they decode! Imagine that!!

Good luck to you in your quest to Captcha some solid data! (had to go there!).

BTW, Welcome and don't be afraid to wander into the field of AI if so inclined.
In the world of AI, it's the thought that counts!



  • Trusty Member
  • ***********
  • Eve
  • *
  • 1279
  • Overclocked // Undervolted
    • Datahopa - Share your thoughts ideas and creations
Re: How to collect information for Business Intelligence project ? :(
« Reply #2 on: December 20, 2012, 11:40:06 am »
Hi Dardie

Web crawlers, there are good and bad, a good crawler will follow web etiquette, such as:

If it finds a "no index no follow" tag on a site it should not read (index) the page or follow links from the page.

It should only crawl about 5 to 10 pages of a site and then go away and come back another day for another 5 or 10.

It should not attempt to sign up to a site and then start posting links, adverts or spamming of any kind. Nor should it go hunting for user details, e-mail addresses or user names.

It should only crawl the web and not the Internet, no they are not the same thing, the web is where the websites are, stick to that.

Crawlers can also be identified by anti spam features of sites and can quickly get their IP added to a black list of known spammers, one must be very careful and follow the etiquette (rules).

So really it boils down to how its done and not what is done.

Just some info that I have personally picked up during my 14 years of web experience.

Good luck :)



  • Trusty Member
  • ********
  • Replicant
  • *
  • 564
    • Neural network design blog
Re: How to collect information for Business Intelligence project ? :(
« Reply #3 on: December 20, 2012, 12:29:08 pm »
Have you already contacted a 'big company' that uses captchas? I'm sure they'll be interested in the results, if they haven't done this before.
Usually, companies tend to be pretty happy helping out in this manner, it's beneficial for them as well.


Users Online

171 Guests, 1 User
Users active in past 15 minutes:
[Trusty Member]

Most Online Today: 189. Most Online Ever: 2369 (November 21, 2020, 04:08:13 pm)