Web scraping county, population and median home value from United States postal zip codes.
I switched from using InternetExplorer.Application
to New MSXML2.XMLHTTP60. For the first time, I broke code into smaller subs or functions, among other things.
- This code ran successfully 1 time, with other tests returning only 70-100 records.
- Using
InternetExplorer.Application
, I projected the code to complete in 1H:45M by timing 20 records. With the XML method, I projected 25-30 min as it takes about 5 min to fetch 70-100 records. - Excel goes completely blank when running (white screen).
There appears to be several things I can do:
- Early binding (which I didn’t understand how to implement based on the thread)
- Creating VBScripts to simulate multi-threading; I’ve not created VBScripts with Excel so this option is taking me a little bit as I try to read and study VBScripts more in depth.
- Can’t seem to find the link, but elsewhere, I read that jumping around would make things slow. According to the thread, I should store all values in an array first, after all values are retrieved, I then should input them into the corresponding cells, instead of retrieving and inputting right away. (I think I can handle this but not sure if anyone has any pointers as to whether this actually works).
Variables
'ZipCodeScrape Variables Public ZipCodeRange As Range Public cell as Variant 'Web Variables Public IE As MSXML2.XMLHTTP60 Public url As String Public post As Object Public HTML As MSHTML.HTMLDocument Public HTMLbody As MSHTML.HTMLbody
Gathering zip codes and using a function to retrieve data
Sub ZipCodeScrape() Set IE = New MSXML2.XMLHTTP60 url = "https://www.unitedstateszipcodes.org/" Set ZipCodeRange = Range("C2", Range("C2").End(xlDown)) Dim TargetElement(1 To 3) As String TargetElement(1) = "County:" TargetElement(2) = "Population" TargetElement(3) = "Median Home Value" Dim i As Integer For Each cell In ZipCodeRange For i = 1 To 3 cell.Offset(0, i).Value = dataScrape("th", TargetElement(i), "td") Next i Next cell End Sub
Here is the function I’m using to retrieve the data
Private Function dataScrape(ByVal TagName As String, Element As String, targetTagName) IE.Open "GET", url & cell.Value, False IE.send While IE.readyState <> 4: DoEvents: Wend Set HTML = New MSHTML.HTMLDocument Set HTMLbody = HTML.body HTMLbody.innerHTML = IE.responseText For Each post In HTMLbody.getElementsByTagName(TagName) If InStr(post.innerText, Element) > 0 Then dataScrape = post.ParentNode.getElementsByTagName(targetTagName)(0).innerText: Exit For End If Next post End Function