Saturday, May 7, 2011

Scraping content from other websites

Scraping is the "harvesting" of content from websites, terms, model numbers and other keywords, and putting them on another website along with other content that the webmaster wants you to visit. Sometimes the content has nothing to do with what you are searching for and it is often generated by a program (not a human being)


I collect data from web pages and places then into new pages that contain content that is relevant to me. The big difference here (with traditional scraping) is that I do this by hand and NOT by a computer program. The net effect here (hopefully) is a collection of something useful and the linking of what people could be searching for and content on this site.

Scraping by design - according to WikiP:

The emergence of XML and web services has lent itself to the creation of  technologies that improve the process of extracting machine-friendly data  from web pages.


Screen Scraping: -  from catb.org

The act of capturing data from a system or program by snooping the contents of some display that is not actually intended for data transport or  inspection by programs. Around 1980 this term referred to tricks like
 reading the display memory of a smart terminal through its auxiliary port.
 Nowadays it often refers to parsing the HTML in generated web pages with programs designed to mine out particular patterns of content. In either guise screen-scraping is an ugly, ad-hoc, last-resort technique that is very likely to break on even minor changes to the format of the data being snooped.

No comments: