Creating a local version of the Pwned Passwords list

Recently, web security chap Troy Hunt released 306 million freely downloadable Pwned passwords and created a website to search them.

HaveIbeenPwned-Password-website-1-

I think this is a very useful resource and one which appears to me to be very well thought out by him. But should you send passwords already in use to him? No, and he himself highlights the point that no active passwords should be sent to 3rd party website - even the one he created!

Ok, but if I do want to search the list for live passwords - how could I do that? As much as I trust this Troy Hunt fella and respect his work, I don't know him, I don't know what sinister logging he's doing behind his website code (does he have an evil scheme to harvest even more passwords before he holds the world to ransom for 1 meeeelllion dollars!!) and quite rightly my employer wouldn't look too kindly on me testing passwords out by sending them to Troy's site

How could I do this? Well, he has provided the passwords (all SHA-1 hashed) in text file(s) to download for use elsewhere. Could I create a version of the Password checker my employer and I trust? A local version? Troy initially provided v1.0 of the file (~11.9GB text file with ~306 million passwords in) shortly after released 2 smaller update files. How could these be used to provide a local copy of the PwnedPassword search utility.

I have some PHP & website coding knowledge and knew that it would be easy to create a front-end website similar to Troy's, but it was going to accessing & searching the (huge) dataset which required a bit of thought. I don't know all that much about databases when it comes to designing them for querying a large number of records and using Azure table storage (and the like) would be a learning curve. To use an on-prem database I'd have to involve the DBAs and ask them to get the data into a database of some description. I knew that going down that route led to a bit of red-tape and I just wanted to quickly produce a proof-of-concept for the InfoSec guys.

I wondered what searching the flat files would look like? Having PHP's strpos function search a few large files felt wrong - very wrong. I didn't test, but I figured that strpos would take longer than the 30 second execution timeout to search a 11.9GB file for a specific string. What could I do? Then I remembered Troy did a blog post about Azure Table storage and how he stores the records for the ever growing HaveIBeenPwned site, and I figured that the password file has well structured data which could follow a similar pattern to the Azure partitions. I knew that all the passwords were SHA-1 hashed so each line in the file is a hexadecimal string of 40 characters and that they appeared to be ordered alphabetically. What if I split the source data up based on the first few digits of the SHA-1 hash? I could produce smaller files which could be named based on their starting prefix. e.g. 0f.txt would contain all the password hashes starting 0F etc. A PHP website could easily identify which txt file to check based on the request and then it would search through a smaller file for a password hash. That'd be quick - right?

I'm a Windows sysadmin by day, so PowerShell is my day-to-day script language. So a PowerShell script was written to run through the files and create the 'partition' files (this is what I've termed the resulting smaller files). ~~I've put the code up on GitHub's GIST should you wish to use it~~ Thanks to 'meilon', there is an updated and much faster version of the script available here

I ran some tests of the script against the small pwned-passwords-update-2.txt file. At first I decided on the first 2 digits (to produce 256 separate files named 00.txt to ff.txt) but the resultant files were likely to be ~50MB each when all the data was parsed; this still felt too large to me so I split the data up based on the first 3 digits of the hash. This produced 4096 separate files named 000.txt to fff.txt. I tested against the main pwned-passwords-1.0.txt file and all were around 3MB in size, containing ~75000 SHA-1 hashes. This felt like a better file size with a manageable amount of lines within each one to search. I left the script to run over the weekend to create the initial partition files from the main pwned-passwords-1.0.txt file.

powershell---processing-v1.0-file-1-
Please come back later....much later!

Following the first run I re-ran the script again, passing in update-1 and update-2 filenames so the content of those files got added to the existing partition files. I'm not bothered that the data within each partition file is now not in alphabetical order, just that the a given partition file contains all SHA-1 hashes starting with the partition name. Going forward I can run the script again in the future to add further updates as/when Troy releases them.

For reference the time the PowerShell script took on my PC (Intel Core i5-3470 (quad-core) with 8GB RAM, running Windows 7 & PowerShell 4.0) was as follows:-

File	Time to process
pwned-passwords-1.0.txt	~3d 8h 30m 20m 39s with updated script
pwned-passwords-update-1.txt	~3h 45m 2m 01s
pwned-passwords-update-2.txt	7m 22s 0m 03s

However looking at my task manager, only 1 core was being used. It's a fairly simple script with no multi-threading fanciness in. I'm not too bothered and I'm not going to spend time optimizing it further as I'm not likely to be running it again against a massive file. Additionally, as the update-1 547MB file only takes under 4h to process, I can tolerate that processing time every once in a while. After all three source files had been processed, each partition file comes to around 3.1MB in size and contains ~78,000 passwords

partition-foles-1-

What about the website? I already had a web server configured for other web-tools our team use so I added to it. The existing site is written in PHP using the CodeIgniter framework served via IIS on a 1 CPU, 2GB RAM Windows Server 2008 R2 virtual machine (I'm a Windows sysadmin, who knows the .NET based PowerShell scripting language, but is happier with PHP than .NET for websites - you know what you know I guess :/ ). It took ~ 1/2 day to put together the core functionality of the site, followed by another ~1/2 day 'tweaking' the layout.

The entries to the password input box is SHA-1 hashed client-side before being submitted to mitigate any possible logging of the plain-text password request. Based on a few unscientific tests, searching for a password is quick enough for me (requests return within a second) to think that it'll be suitable for the anticipated load it will get from the staff here at work.

Does any of the wording I've used on the website look familiar to anyone? ;)

website-1-

I'm hopeful it can be used for colleague awareness to improve password quality and security, but at the very least I can use it as an extra check for passwords being created, as per the NIST guidance on page 14 of this publication:-

NIST-guidance-1-

Finally I'd like to know what do you think about my approach? Is this something you're doing yourself? How are you making use of the Pwned password list? Leave a message in the comments below

Update 17/Aug/2017

Thanks to a comment by meilon, he updated the PowerShell script to make it much faster. I've updated this blog post to reference his updated script and the timings based on my slow office PC