However, for one part, the play-by-play commentaries,
nothing was working, and I ended up relying on recording and repeating mouse-and-keyboard
macros. It's crude, but the loading-as-scrolling mechanic was just too hard to
deal with programmatically, even with otherwise very powerful Rselenium.
Using macros to scrape pages is a trial-and-error process,
but with the following principles, you can drastically reduce the number of trials
it takes to get it right.
Figure 1 - Editing a macro in Asofttech Automation |
Figure 2 - Choosing a macro in Asofttech Automation |
In the past I have
used Automacrorecorder, but A.A. offers features like appending a macro or
deleting parts after the fact, rather than having to record a brand new macro
for each change. This could all theoretically be done with Autohotkey as well,
but if you're not already very familiar with Autohotkey, then this process is
probably easier with a less powerful, but more user-friendly, tool like
Asofttech Automation.
Prep Principle A: Do whatever you can programmatically
first.
Figure 3 is a list with two commentary URLs for each match,
one for each innings, for an Indian Premier League (IPL) season. This list was
generated through some string manipulation with stringr, but mostly through the
getHTMLlinks() function in Rvest.
Prep Principle B: Plan out your goal and the steps needed
as explicitly as possible.
We need, from each of these pages, to highlight all the text
and save it into a text file so that we can filter it down to the play-by-play
commentary later. We do this by…
1. Taking the first URL in the list, cutting it (and thereby
removing it from the list),
2. Pasting it into navigation bar of the browser, pressing
enter,
3. Waiting for the page to load, pressing 'down' or 'page
down' until the end of the page is reached,
4. Selecting all (Crtl + A or command + A), and copy/pasting
all of this into a waiting txt file.
Repeat this process
hundreds of times.
The main principle is to minimize variation. That means do
whatever you can before recording your macro to make sure that it's doing
exactly the same thing as you did when you recorded it.
Principle 1) Close every non-essential window,
especially ones that might create pop-up messages. If the webpages you're scraping have pop-ups or elements that might inconsistently disrupt scraping, use a script blocker.
Principle 2) Write
down the windows that are open, and what arrangement they're in.
As a human, you can see when, say, the 'notepad' window is
on the left half of the screen instead of the right, but a macro is just
blindly repeating previously recorded clicks and key presses.
In my macro, four
windows are open: A notepad with the URLs, a notepad to paste raw text into, a
web browser, and Automation. They appear in my taskbar in exactly that order.
Principle 3) Use the keyboard whenever possible.
(Instead of clicking and dragging to highlight text, use Crtl + A)
Principle 4) When you must use the mouse, only
interact with things that will always be in the same place and that you can
control.
For example: You can
paste something to the navigation bar, but avoid clicking on links if possible.
Why? Because webpage layouts change often, and that link may be somewhere else later, even a few pixels of difference can result in a link not being clicked.
Principle 5) Don't give you macro opportunities to
'wander'.
Specifically, don't use navigation buttons like 'back'.
If something unusual happens like a page fails to load, or
the expected page isn't there (say, due to a cancelled or abandoned match), then
the back button might not bring to you your previous page, but the page BEFORE
that.
Likewise, don't open the start menu during your macro,
because later you might end up opening an unexpected program.
Changing the active window with alt + tab is also dangerous
because that shortcut depends on the order that windows were last active, which
is difficult to keep track of, and needs to be the same as it was in the 'home
state' (see Principle 7).
Principle 6) When interacting with things you can't
control, leave wide margins for error.
That means waiting extra time for web pages to load. In the
case of scrolling webpages, it means pressing 'down' or 'page down' about twice
as much as necessary for a typical web page.
Principle 7) Return to the 'home state' at the end of
the macro. That means have the same window active at the end of the recorded
macro as you did at the start. If possible, put the keyboard cursor is the same
place as when you started.
In my macro, each
time I use a URL, I remove it from the list, so that the top line of the URL
list is a new URL each loop, this allows me to set my home state to the very
top-left of the URL notepad file without having to count keypresses. Instead, I
just click anywhere in that notepad and press 'Crtl + Home' to get to the
beginning.
Finally, have patience. I've been using macros like this for years, and this one took me 7 tries to get right, and one of those unsuccessful tries ran for 2 hours before a fault appeared. Like a lot of automation, it's tedious work until it suddenly isn't.
No comments:
Post a Comment