Common Crawl

  • 1 Replies
  • 99 Views
*

infurl

  • Trusty Member
  • *********
  • Terminator
  • *
  • 752
  • Humans will disappoint you.
    • Home Page
Common Crawl
« on: March 02, 2020, 03:52:09 AM »
https://commoncrawl.org/

Quote
The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions.

Need a lot of data off the internet to feed your artificial intelligence project? You can find it at Common Crawl, a completely free and open project to retrieve and archive vast numbers of web sites. Save yourself a huge amount of time and hassle by using their pre-packaged data. This is the closest thing to having the entire internet in a box that you're ever going to find.


*

Dat D

  • Bumblebee
  • **
  • 41
  • AI rocks!
Re: Common Crawl
« Reply #1 on: March 03, 2020, 02:22:54 AM »
https://commoncrawl.org/

Quote
The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions.

Need a lot of data off the internet to feed your artificial intelligence project? You can find it at Common Crawl, a completely free and open project to retrieve and archive vast numbers of web sites. Save yourself a huge amount of time and hassle by using their pre-packaged data. This is the closest thing to having the entire internet in a box that you're ever going to find.
I sometimes crawl the net myself, Google Puppeteer is superb!

https://developers.google.com/web/tools/puppeteer