One of the many things F# is great for is screen scraping. Here’s why:
Downloading multiple pages asynchronously and in parallel is trivial with F#’s async support
Navigating the HTML DOM is a great fit for higher order data processing combined with partial application
F# Interactive really shines in iterative processes like this, where you try something out, see it didn’t work quite well, and keep adjusting until you get it right. Doing a full compile-run cycle on each iteration instead of simply evaluating in the REPL would make this task take much more time-consuming
Html Agility Pack is the obvious candidate to use for screen scraping in .NET, but like other LINQ-like libraries that rely heavily on extension methods, its API isn’t ideal for use in F#. A simple wrapper will take care of that problem:
An F# wrapper for HtmlAgilityPack - HtmlAgilityPack.FSharp.fs
When we leave the “target” of an operation as the last parameter of a function, we can leverage partial application to remove the need to create anonymous lambdas when using higher order functions such as the ones needed for DOM traversing.
In this case, by taking the this parameter of the C# methods and putting it as the last parameter of our wrapper functions elements, descendants, innerText, etc., our code is now much easier to read and to understand than if we used the Html Agility Pack methods directly.
As an example, here is a small script that I used when developing one of my Windows Phone apps to get the list if all the rail stations in Ireland together with their coordinates:
#r"System.Net"#r"../lib/portable/FSharp.Data.dll"#r"../packages/HtmlAgilityPack-PCL.1.4.6/lib/HtmlAgilityPack-PCL.dll"#load"HtmlAgilityPack.FSharp.fs"// get a page that lists the stations that start with firstLetterletgetStationListPagefirstLetter="http://www.irishrail.ie/cat_stations_list.jsp?letter="+(stringfirstLetter)|>Http.AsyncRequestString// get all the links to stations inside the <ul class="results">letgetStationsstationListPage=stationListPage|>createDoc|>descendants"ul"|>Seq.filter(hasClass"results")|>Seq.head|>descendants"a"|>Seq.map(attr"href")|>Seq.toArray// get the page for a stationletgetStationPagestation="http://www.irishrail.ie/"+station|>Http.AsyncRequestString// get the latitude and longitude of a station from the google maps link in the station pageletgetCoordinatesstationPage=letgoogleMapsLink=stationPage|>createDoc|>descendants"div"|>Seq.filter(hasId"map-ordnance")|>Seq.head|>followingSibling"ul"|>descendants"a"|>Seq.head|>attr"href"letsplit(c:char)(s:string)=s.Splitclet[|"ll";coords|]=Uri(googleMapsLink).Query|>split'&'|>Seq.map(split'=')|>Seq.filter(Seq.head>>(=)"ll")|>Seq.headlet[|lat;long|]=coords|>split','|>Array.mapfloatlat,longletstationsAndCoords=letstations=['A'..'Z']|>Seq.mapgetStationListPage|>Async.Parallel|>Async.RunSynchronously|>Array.collectgetStationsletlat,long=stations|>Seq.mapgetStationPage|>Async.Parallel|>Async.RunSynchronously|>Array.mapgetCoordinates|>Array.unzipletstations=stations|>Array.mapUri.UnescapeDataStringArray.zip3stationslatlong
You can also see that the same pattern was used for making the String.Split function play well with partial application.
Another neat feature of F# for scripting (I wouldn’t recommend incorporating it in production code), is the ability to de-structure arrays in one liners, as done in let [| "ll" ; coords |] = and let [| lat; long |] =. The compiler will emit a warning saying that the match is not exhaustive, telling us that this might backfire if there are less than two elements in the array, but for the purpose of a one shot script to download some data it’s fine.
And to give a second example, here’s a snippet from my Learn On The Go app that processes the html of the lecture videos page of a Coursera course: