how to scrub websites?

Homerboy

Lifer
Mar 1, 2000
30,859
4,976
126
We often hire 3rd party contractors to build us simple applications that allow us to do some data-mining en masse off of various public websites.

In most scenarios, we feed the program (or the 3rd party directly) a csv of the data fields we want entered into the forms on the website(s), and it basically automates the task of entering the information into the forms and capturing the results (either in a screen shot, or a return csv of the data etc etc)

I was curious as to how they go about doing this. Are they just parsing the HTML of the document itself and inserting where applicable or ???

** please note that these sites are set up for these sort of uses and have no problems with such practices.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,360
4,065
75
It sounds like they're writing apps that use a web service. Basically, if you POST to a site with the right text, you'll get the right text back, whether as HTML or some other file.

If they're using an actual web browser and screen shots, that's some kind of automation system like Selenium.
 

Hmongkeysauce

Senior member
Jun 8, 2005
360
0
76
I wrote a vb script for excel a few weeks back that did something similar. The script takes data from excel and pastes the data into a field on an online form and submits the form. That's far as I had to go, but I'd imaging grabbing the data would be a matter of waiting for the page to POST and grabbing the values from the correct elements. If the form uses GET, then it's probably even easier. A snippet of the script that I wrote is below:

Code:
    With CreateObject("InternetExplorer.Application")
        .Visible = True
        .Navigate "http://www.yourform.com"
        Do Until .ReadyState = 4
        Loop
        With .document
            .getElementById("sourceData").Value = sText
            .getElementById("submitbutton").Click
        End With
    End With

sText is the data from Excel
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
For sites without an API or web service, "screen scraping" is when you make HTML POST or GET requests to a web page, receive the HTML response back (with results like doing "view source" in your browser) then use string functions to get the data out of the response.

This approach is brittle since if the page from the server changes, your code to find the data may break.

The code for each page is also usually going to be different, both to send the request and to process the result.
 

beginner99

Diamond Member
Jun 2, 2009
5,231
1,605
136
For sites without an API or web service, "screen scraping" is when you make HTML POST or GET requests to a web page, receive the HTML response back (with results like doing "view source" in your browser) then use string functions to get the data out of the response.

This approach is brittle since if the page from the server changes, your code to find the data may break.

The code for each page is also usually going to be different, both to send the request and to process the result.

or maybe the site / web service offers the option to get the data in a more usable format like JSON.
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
or maybe the site / web service offers the option to get the data in a more usable format like JSON.

Good point. I did say "For sites without an API or web service ....", but it's true that sometimes you can use scraping to make your own (brittle, undocumented) web service by figuring out what JSON calls a page sent from the server makes to fill in parts of itself.

You still need to impersonate a browser instead of using a stable and supported SOAP, REST etc. service with documented behavior, but you may be able to get a response that's easier to parse than a page full of HTML.
 
Last edited:

Leros

Lifer
Jul 11, 2004
21,867
7
81
It sounds like they're writing apps that use a web service. Basically, if you POST to a site with the right text, you'll get the right text back, whether as HTML or some other file.

If they're using an actual web browser and screen shots, that's some kind of automation system like Selenium.

Be aware that a lot of websites won't accept a form submission without a valid CSRF token, which makes POSTing to the site a bit more difficult since you need to actually load the page with the form to get the CSRF token. Still completely doable though.
 

beginner99

Diamond Member
Jun 2, 2009
5,231
1,605
136
Good point. I did say "For sites without an API or web service ....", but it's true that sometimes you can use scraping to make your own (brittle, undocumented) web service by figuring out what JSON calls a page sent from the server makes to fill in parts of itself.

You still need to impersonate a browser instead of using a stable and supported SOAP, REST etc. service with documented behavior, but you may be able to get a response that's easier to parse than a page full of HTML.

cURL or wget can be used for that. I know what you mean by undocumented. Recently had to do exactly that. Data displayed on page was loaded by AJAX as JSON and was pretty straight forward to make a cURL GET request that fetched all the data.

This is sure one thing anyone should also consider when creating web pages. Are users allowed to access data that way? IMHO this can easily be forgotten...
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
Scraping Windows apps is done by the bad guys for harvesting passwords and account security pics so it's not something I'd offer tips on. What's your intended use, or just curious?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |