Subj. Atm I’m using Selenium and Python, but the same applies to any other scraping solution.
I’m wondering 1) which of the options outlined below are optimal/recommended/best practices and 2) if there are existing solutions/helper libraries, which keywords I should look them up by. To stay objective, “optimal/recommended/best practices” means “widely used and/or promoted/endorsed by high-profile projects in the niche.” I couldn’t find any Selenium-related or general-purpose material on this topic having spent about a day of net time searching around which probably means I’m lacking some critical piece(s) of information.
The basic operations when scraping are:
- searching for element (by CSS selector/XPath and/or by hand for things that those aren’t capable of)
- interacting with an element (input text, click)
- read element data
And the call chain goes like this:
(Test code ->) User code -> Framework (selenium) -> Browser (web driver) -> Site
So, there are 3 hops here that I could mock. Each one poses challenges:
- Mock the site: launch a local HTTP server and direct the browser there
- Have to reimplement the scraped site’s interface, in web technologies
- Mock the browser (e.g. populate HtmlUnit (an in-process browser engine) with predefined HTML at appropriate moments)
- much simpler but still need to emulate state transitions/action reactions somehow
- Mock the framework calls
- The truest to the unit testing philosophy, the least work
- I’m however worried that it’s too restrictive. E.g. I can find the same element by various means. A mock object can only accept a very specific course of action as it lacks the sophistication to e.g. check if some other selector would produce the same result.
There are also two options for what content to provide — either
- provide the site’s original content that it produced for a test query, compiling it into some sort or self-contained package
- labor-intensive and error-prone, or
- provide the bare minimun to satisfy the tested algorithm
- much simpler but would fail for other possible algorithms that would succeed with the real site
One last concern is the fact that a site is effectively a state machine. I’m not sure which will be more useful:
- implement the complete state machine, probably as some kind of specification, and set/check its states in the tests
- very labor-intensive without some kind of library that reduces the work to writing a formal specification; or
- simply validate the action sequences
- which doesn’t seem to actually test the code against anything — it merely reiterates what the code does
✓ Extra quality
ExtraProxies brings the best proxy quality for you with our private and reliable proxies
✓ Extra anonymity
Top level of anonymity and 100% safe proxies – this is what you get with every proxy package
✓ Extra speed
1,ooo mb/s proxy servers speed – we are way better than others – just enjoy our proxies!
USA proxy location
We offer premium quality USA private proxies – the most essential proxies you can ever want from USA
Our proxies have TOP level of anonymity + Elite quality, so you are always safe and secure with your proxies
Use your proxies as much as you want – we have no limits for data transfer and bandwidth, unlimited usage!
Superb fast proxy servers with 1,000 mb/s speed – sit back and enjoy your lightning fast private proxies!
99,9% servers uptime
Alive and working proxies all the time – we are taking care of our servers so you can use them without any problems
No usage restrictions
You have freedom to use your proxies with every software, browser or website you want without restrictions
Perfect for SEO
We are 100% friendly with all SEO tasks as well as internet marketing – feel the power with our proxies
Buy more proxies and get better price – we offer various proxy packages with great deals and discounts
We are working 24/7 to bring the best proxy experience for you – we are glad to help and assist you!