How to Download an Entire WordPress Blog
Sometimes, you stumble on a blog that is so chock full of information that you revel in its every word. And then you realize their archive goes back 5 years!
I’ve read a bunch of great posts on Nate Lawson’s awesome security blog, and decided that I wanted to read it beginning to end.
If you are the owner of said WordPress blog, the solution is easy – use WordPress’ built-in Export feature. There are even handy services that will turn this into a PDF, eBook, or even printed book.
If you’re not the owner, things aren’t so easy:
- Sit in front of the computer
- Go to the oldest month in the archives menu that I hadn’t yet visited
- Read that page
- Click “Next Page” until those links stop appearing
- Go back to the home page
- (Repeat steps 2-5 until you’re done with the blog)
Oh, and if you intend to take a break at any point in time, add in a few “try to remember where you were, and find that blog post again” entries.
What I was really hoping for was:
- Open a PDF on my Kindle, and read the entire thing in chronological order, letting the Kindle software keep track of where I am.
It turns out that the difference between reality and desire is about twelve lines of PowerShell!
PowerShell’s recent technology previews (and the Windows 8 consumer and developer preview) include the Invoke-WebRequest cmdlet.Think wget / curl, but with PowerShell’s traditional object-based awesome-sauce. For example:
PS C:\\temp> Invoke-WebRequest http://www.leeholmes.com/blog |
>> Foreach-Object Links |
>> Where-Object InnerText -match "August" |
>> Foreach-Object Href
http://www.leeholmes.com/blog/2011/08/
http://www.leeholmes.com/blog/2010/08/
http://www.leeholmes.com/blog/2008/08/
http://www.leeholmes.com/blog/2007/08/
http://www.leeholmes.com/blog/2006/08/
http://www.leeholmes.com/blog/2005/08/
When you look at links to the monthly archives, they all follow the pattern:
//">//">http://www.example.com/url/<number><number><number><number>/<number><number>/
When you visit any of these pages, they have another link. The exact text depends on the blog itself – but it may be “Earlier Entries”, “Next Page”, or similar:
PS C:\\temp> $page = Invoke-WebRequest http://www.leeholmes.com/blog/2005/06/
PS C:\\temp> $page.Links | Where-Object InnerText -match "Earlier Entries" |
>> Select-Object -First 1
>>
innerHTML : Earlier Entries ?
innerText : Earlier Entries ?
outerHTML : <A href="http://www.leeholmes.com/blog/2005/06/page/2/">Earlier Entries ?</A>
outerText : Earlier Entries ?
tagName : A
href : http://www.leeholmes.com/blog/2005/06/page/2/
Given that knowledge, we can automate the download of the entire blog, dumping it into an HTML file as we go. As a final step, we print this HTML to PDF, and upload it to our Kindle or other reading device.
Note to purists: this HTML file is brutally malformed. It is a collection of HTML pages packed into the same file, rather than one HTML page with all the important content. It is of course possible to make this a valid HTML file by manipulating the content before writing it – there’s just no need to do it if the destination is a PDF anyhow.
And how about time effort? In the end, I had a PDF of the entire blog on my Kindle 20 minutes after first having thought of it.
Here’s the PowerShell script that automates this all – cleaned up for your consumption, of course :)
## Things you might want to change
$blogUrl = "http:/www.leeholmes.com/blog"
$archiveLinkPattern = '/\d\d\d\d/\d\d/$'
$nextPageText = "Earlier Entries"
## Get the page
$r = Invoke-WebRequest $blogUrl
## Extract the archives links
$links = $r.Links | Where-Object href -match $archiveLinkPattern |
Foreach-Object href
## Sort the archives in reverse order
$links = $links[$links.Count..0]
## Go through each archive page
foreach($link in $links)
{
## Create a variable to hold the HTML content for this month
$monthExport = ""
do
{
## Get the archives for that month
$month = Invoke-WebRequest $link
## Get the page content, and put it at the beginning of the
## monthExport variable. That's because "Earlier Entries"
## should be placed before the content we just got.
$monthExport = $month.Content + "`r`n" + $monthExport
## Find the link to "Earlier Entires"
$link = $month.Links | ? innertext -match $nextPageText |
Foreach-Object href | Select-Object -First 1
## Keep on doing this while we found an "Earlier Entries" link
} while($link)
## Now that we're done with the month, put it at the end of the
## HTML file (since we're processing months in order)
$monthExport >> leeholmes.html
}