Glenn Gutmacher joined us at SourceCon Fall 2013 and delivered a well-received presentation about gathering candidate information by scraping the web. Outwit Hub was one of the primary tools he covered in his presentation. A number of people were impressed with what he shared, but needed more clarification. This morning, Glenn and I recorded a quick demo to help you get started..
This is definitely for our more technical readers. A basic understanding of HTML and Javascript will help one to take full advantage of this tool.
Article Continues Below
The files discussed in the video can be downloaded here.
It was fun (thanks, Jeremy), and I think I made it easier to absorb by providing the
step-by-step as text instructions as well as the actual Outwit gear file, so definitely click that link above to download the 2 files and then you can easily do everything I showed and more!
Glenn, Can you link to outwit hub specific syntaxes you referenced?
Also perhaps where to learn about the regular expressions or regex? Or specifically how you knew what you needed to change it to? Is there also a faq or something from Outwit?
Randy, when you load the 2 files from Dropbox attached to this
article, you will see a lot of the syntax. A regex syntax example
related to my demo for the “Apply If Page URL Contains” field would be
/pinpoint.microsoft.com/en-US/partners/[-A-Za-z0-9]*$/ which would
tell Outwit to only run the scraper on pages containing the pattern
between the / and / slashes. Obviously,
pinpoint.microsoft.com/en-US/partners/ is how each partner page URL
starts, but in regex, you have to precede each forward slash in the URL
with a backward slash to indicate that the next character should be
escaped (i.e., treated as is, and not interpreted as a special regex
command character – other special characters like dollar sign, period,
pipe, etc., should also be escaped if you want them to be recognized as
what they are, rather than the command they also mean in regex). The
next portion inside the square brackets is to indicate a pattern of
characters that would follow next in the URL, which can be a hyphen
(note it is preceded by to escape it), a capital letter (range A-Z), a
lowercase letter (a-z), or a digit (0-9). Immediately after the right
square bracket is the asterisk, indicating there may be multiple such
characters, and then the final $ indicates the end of the string (i.e.,
if the URL took on a different pattern after this point, then it would
not be a URL that I want to process). So that hopefully illustrates one
of the more complex parts of one of my scrapers in the attached gear
file. Unfortunately, if you have the free version of Outwit Hub, you
can only share scrapers. To share macros, you would need the paid
version (about $75 lifetime license from outwit.com), so you would not be able to use everything in my gear file until you did that.