The title of this article probably sounds like the caption to a meme. Instead, this is an actual problem GitGuardian’s engineers had to solve in implementing the mechanisms for their new HasMySecretLeaked service. They wanted to help developers find out if their secrets (passwords, API keys, private keys, cryptographic certificates, etc.) had found their way into public GitHub repositories. How could they comb a vast library of secrets found in publicly available GitHub repositories and their histories and compare them to your secrets without you having to expose sensitive information? This article will tell you how.
First, if we were to set a bit’s mass as equal to that of one electron, a ton of data would be around 121.9 quadrillion petabytes of data at standard Earth gravity or $39.2 billion billion billion US dollars in MacBook Pro storage upgrades (more than all the money in the world). So when this article claims GitGuardian scanned a “ton” of GitHub public commit data, that’s figurative, not literal.
But yes, they scanned a “ton” of public commits and gists from GitHub, traversing commit histories, and found millions of secrets: passwords, API keys, private keys, cryptographic certificates, and more. And no, “millions” is not figurative. They literally found over 10 million in 2022.
How could GitGuardian make it possible for developers and their employers to see if their current and valid secrets were among that 10+ million without simply publishing millions of secrets, making it easier for threat actors to find and harvest them, and letting a lot of genies out of a lot of bottles? One word: fingerprinting.
After some careful evaluation and testing, they developed a secret-fingerprinting protocol that encrypts and hashes the secret, and then just a partial hash is shared with GitGuardian. With this they could limit the number of potential matches to a manageable number without knowing enough of the hash to reverse and decrypt it. To further ensure security, they put the toolkit for encrypting and hashing the secret on the client-side.
If you’re using the HasMySecretLeaked web interface, you can copy a Python script to create the hash locally and just put the output in the browser. You never have to put the secret itself anywhere it can be transmitted by the browser and you can easily review the 21 lines of code to prove to yourself that it’s not sending anything outside the terminal session you opened to run the script. If that’s not enough, open the F12 developer tools in Chrome or another browser and go to the “Network” panel to monitor what information the web interface is sending upstream.
If you’re using the open source ggshield CLI you can inspect the CLI’s code to see what is happening when you use the hmsl command. Want even more assurance? Use a traffic inspector like Fiddler or Wireshark to view what’s being transmitted.
GitGuardian’s engineers knew that even customers who trusted them would be apprehensive about pasting an API key or some other secret into a box on a web page. For both security and the peace of mind of everyone who uses the service, they chose to be as transparent as possible and put as much of the process under customer control as possible. This goes beyond their marketing materials and into the ggshield documentation for the hsml command.
GitGuardian went the extra mile to make sure that people using their HasMySecretLeaked checker don’t have to share the actual secrets to see if they leaked. And it’s paid off. Over 9,000 secrets were checked in the first few weeks it was live.
If your secrets have already been publicly divulged, it’s better to know than not. They may not have been exploited yet, but it’s likely just a matter of time. You can check up to five per day for free via the HasMySecretLeaked checker via the web, and even more using the GitGuardian shield CLI. And even if you’re not looking to see if your secrets leaked, you should look at their code and methods to help inspire your efforts to make it easier for your customers to share sensitive information without sharing the information itself.