Extracting Tables from PowerShell's Invoke-WebRequest
If you’ve ever wanted to extract tables from a web page in PowerShell, the Invoke-WebRequest cmdlet is exactly what the doctor ordered.
Once you’ve invoked the cmdlet, the ‘ParsedHtml’ property gives you access to the Internet Explorer DOM of that page. From there, you can get elements by tag name (“TABLE”), ID, and more.
One neat application of this technique is to automatically parse data out of tables on the web page. I recently needed to do this, and the PowerShell script really wasn’t that complicated. In true PowerShell style, each row of the table is output as an object – that way, you can access the data as you would with any other PowerShell cmdlet. Even better - if the table uses the TH tag (“Table Heading”), it uses those headings as property names for the output objects.
Here’s an example of it in action:
1 [C:\Users\leeholm]
>> $url = 'http://www.egyptianhieroglyphs.net/gardiners-sign-list/domestic-and-funerary-furniture/'
2 [C:\Users\leeholm]
>> $r = Invoke-WebRequest $url
3 [C:\Users\leeholm]
>> Get-WebRequestTable.ps1 $r -TableNumber 0 | Format-Table -Auto
P1 P2 P3 P4
-- -- -- --
Gardiner Number Hieroglyph Description of Glyph Details
Q1 Seat Phono. st, ws, . In st ?seat, place,? wsir ?Osiris,? ?tm ?perish.?
Q2 Portable seat Phono. ws. In wsir ?Osiris.?
Q3 Stool Phono. p.
Q4 Headrest Det. in wrs ?headrest.?
Q5 Chest Det. in hn ?box,? ?fdt ?chest.?
Q6 Coffin Det. or Ideo. in qrs ?bury,? krsw ?coffin.?
Q7 Brazier with flame Det. of fire. In ?t ?fire,? s?t ?flame,? srf ?temperature.?
4 [C:\Users\leeholm]
And the script:
param(
[Parameter(Mandatory = $true)]
[Microsoft.PowerShell.Commands.HtmlWebResponseObject] $WebRequest,
[Parameter(Mandatory = $true)]
[int] $TableNumber
)
## Extract the tables out of the web request
$tables = @($WebRequest.ParsedHtml.getElementsByTagName("TABLE"))
$table = $tables[$TableNumber]
$titles = @()
$rows = @($table.Rows)
## Go through all of the rows in the table
foreach($row in $rows)
{
$cells = @($row.Cells)
## If we've found a table header, remember its titles
if($cells[0].tagName -eq "TH")
{
$titles = @($cells | % { ("" + $_.InnerText).Trim() })
continue
}
## If we haven't found any table headers, make up names "P1", "P2", etc.
if(-not $titles)
{
$titles = @(1..($cells.Count + 2) | % { "P$_" })
}
## Now go through the cells in the the row. For each, try to find the
## title that represents that column and create a hashtable mapping those
## titles to content
$resultObject = [Ordered] @{}
for($counter = 0; $counter -lt $cells.Count; $counter++)
{
$title = $titles[$counter]
if(-not $title) { continue }
$resultObject[$title] = ("" + $cells[$counter].InnerText).Trim()
}
## And finally cast that hashtable to a PSCustomObject
[PSCustomObject] $resultObject
}