Thoughts from the UK: Scraping content from other websites

Saturday, May 7, 2011

Scraping content from other websites

Scraping is the "harvesting" of content from websites, terms, model numbers and other keywords, and putting them on another website along with other content that the webmaster wants you to visit. Sometimes the content has nothing to do with what you are searching for and it is often generated by a program (not a human being)

I collect data from web pages and places then into new pages that contain content that is relevant to me. The big difference here (with traditional scraping) is that I do this by hand and NOT by a computer program. The net effect here (hopefully) is a collection of something useful and the linking of what people could be searching for and content on this site.

Scraping by design - according to WikiP:

The emergence of XML and web services has lent itself to the creation of technologies that improve the process of extracting machine-friendly data from web pages.

Screen Scraping: - from catb.org

The act of capturing data from a system or program by snooping the contents of some display that is not actually intended for data transport or inspection by programs. Around 1980 this term referred to tricks like
reading the display memory of a smart terminal through its auxiliary port.
Nowadays it often refers to parsing the HTML in generated web pages with programs designed to mine out particular patterns of content. In either guise screen-scraping is an ugly, ad-hoc, last-resort technique that is very likely to break on even minor changes to the format of the data being snooped.

Matt Cutts: Google Algo Change Targets Dupe Content - webmasterworld.com Jan 28, 2011
Wikibase - an example of such a scraper site
English Feeder - another
Systems page

Thoughts from the UK

Saturday, May 7, 2011

Scraping content from other websites

No comments:

Previous Posts