Ai Dreams Forum

Member's Experiments & Projects => General Project Discussion => Topic started by: infurl on March 02, 2020, 03:52:09 am

Title: Common Crawl
Post by: infurl on March 02, 2020, 03:52:09 am: https://commoncrawl.org/ (https://commoncrawl.org/)

Quote
The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions.

Need a lot of data off the internet to feed your artificial intelligence project? You can find it at Common Crawl, a completely free and open project to retrieve and archive vast numbers of web sites. Save yourself a huge amount of time and hassle by using their pre-packaged data. This is the closest thing to having the entire internet in a box that you're ever going to find.
Title: Re: Common Crawl
Post by: Dee on March 03, 2020, 02:22:54 am: Quote from: infurl on March 02, 2020, 03:52:09 am
https://commoncrawl.org/ (https://commoncrawl.org/)

Quote
The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions.

Need a lot of data off the internet to feed your artificial intelligence project? You can find it at Common Crawl, a completely free and open project to retrieve and archive vast numbers of web sites. Save yourself a huge amount of time and hassle by using their pre-packaged data. This is the closest thing to having the entire internet in a box that you're ever going to find.
I sometimes crawl the net myself, Google Puppeteer is superb!

https://developers.google.com/web/tools/puppeteer