Web scraping C#

MrScott81

Golden Member
Aug 31, 2001
1,891
0
76
Trying to scrape some data from this website but if when I grab read the page (and if you right click and view source in chrome) you end up getting something different than the final source:

http://tempostorm.com/decks/kitkatzs-control-warrior

Once the page is completely loaded you can inspect the page using chrome debugging you can see the "final" version.

Any idea how to get this "final" version in C#?

This is a snippet from what I'm using:
Code:
var wc = new WebClient();
var websiteContent = await wc.DownloadStringTaskAsync(new Uri(url));
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
You need to run a headless browser like phantomJS to render the page in memory. You're getting the initial version before the javascript runs client side and modifies the DOM. I haven't used phantom with C# on the .Net stack, but I know people are doing it.

Edit: I should add to my simplistic answer. Maybe you have to render the page, maybe you don't. It all depends on the page in question. There are all sorts of little tricks to this, and you pretty much have to dive into the page to figure out the best way to get what you want.

The bottom line is that after that page is loaded in a browser client-side javascript is going to run and do stuff to it. Sometimes that means the stuff you want is going to get fetched from a service sometime during or after page load. Sometimes the stuff you want is already in the page source, but it's somewhere else and the script is going to move it around when the page is rendered. It all depends on how they structured the site and when they load content.
 
Last edited:

MrScott81

Golden Member
Aug 31, 2001
1,891
0
76
WebBrowser doesn't get the proper html as far as I can tell, I'll have to keep digging.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
WebBrowser doesn't get the proper html as far as I can tell, I'll have to keep digging.

So the WebBrowser control doesn't render javascript? Anyway you're going to have to find a way to do that or mine the info you want from what is in the initial html response. If it isn't there then rendering is the only choice. Google around for C# and headless browsers. I see a few discussions. Must be some solution you can put together.
 

nickbits

Diamond Member
Mar 10, 2008
4,122
1
81
WebBrowser control should work. Make sure to wait for DocumentComplete before you grab the html. I found it works best to use InvokeScript and call eval with document.body.outerHTML as its argument.

e.g.
browser.Document.InvokeScript("eval", new object[] { "document.body.outerHTML" });


Also maybe useful--I'm a fan of CsQuery for traversing the html.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |