PGP encryption using python in Azure databricks

Anupam Chand
5 min readAug 16, 2022

--

Many data analytics platforms ingests files in the order of terabytes. Some of these files contain sensitive data and the organization may insist on some type of encryption.
PGP(pretty good privacy) is a type of RSA (asymmetric) encryption where the data is encrypted with 1 key but decrypted with another key (aka public key pair).

Creation of key pair

The first step is to create a private and public key pair. This is a one time activity that has to be done by the receiver. A separate key pair must be generated for each sender/receiver combination.

We can then generate the generate 2048, 3072 or 4096 bit sized public and private keys with the below powershell code which can be executed locally on Visual Studio code. 2048 is considered secure right now. You may feel tempted to go for the largest key size but keep in mind, the larger the key size, the more secure the encryption but also the more compute and time required for encryption and decryption.

#https://evotec.xyz/encrypting-and-decrypting-pgp-using-powershell/
# https://github.com/EvotecIT/PSPGP

$basepath = "C:\Users\Anupam.Chand\OneDrive - Shell\Documents\Work\Architecture stuff\Azure\SQL\Powershell\PSPGP"
$global:pubpath = $basepath + "\PublicPGP1.asc"
$global:pripath = $basepath + "\PrivatePGP1.asc"
$global:strength = 4096 # Can be 2048, 3072 or 4096
$global:blank = ""
function new_key(){
New-PGPKey -Strength $global:strength -Certainty $global:blank -FilePathPublic $global:pubpath -FilePathPrivate $global:pripath
}

function convert_b64(){
$text = Get-Content $global:pripath -Encoding Byte
$base64 = [System.Convert]::ToBase64String($text)
Write-Host $base64
}

new_key
convert_b64

The above example does not use a userid or passphrase but these can be added if needed. Check the github library for examples.

The above code will output.

  1. Private key as a file
  2. Public key as a file
  3. Base64 encoded Private key to console

Load the base64encoded privatekey into the Azure key vault as a secret. Let’s call it ‘privatekb64’. The decryption databricks notebook should have access to this secret.

Encryption

Share the public key .asc file with the Source system. They will use this to encrypt the file before sending. Usually, the encrypted file should have a .pgp suffix. Example.csv will become Example.csv.pgp. This is so the decryptor will know what the extension of the unencrypted file is. This agreement should be made with the sender.

To simulate the encryption, we can use the online tool. https://pgptool.org/#

Decryption

We will use the below code to do our decryption in our databricks notebook using python. This will take in the encrypted file from above, decrypt it and write the resultant dataset into ‘testdecrypt.csv’.

#Decrypting a file using private key
import pgpy
from pgpy.constants import PubKeyAlgorithm, KeyFlags, HashAlgorithm, SymmetricKeyAlgorithm, CompressionAlgorithm
from timeit import default_timer as timer
import base64
import io

def get_private_key():
pk_base64 = dbutils.secrets.get(scope = "PGP", key = "privatekb64")
pk_string = base64.b64decode(pk_base64)
pk_string = pk_string.decode("ascii")
return str(pk_string)

private_key = get_private_key()
KEY_PRIV = private_key.lstrip()

priv_key = pgpy.PGPKey()
priv_key.parse(KEY_PRIV)
pass

#PGP Deryption start
t0 = timer()
message_from_file = pgpy.PGPMessage.from_file('/dbfs/mnt/anuadlstest/test.csv.pgp')
raw_message = priv_key.decrypt(message_from_file).message
with open('/dbfs/mnt/anuadlstest/testdecrypt.csv', "w") as csv_file:
csv_file.write(raw_message)
print("Decryption Complete :" + str(timer()-t0))

Source and target file comparison

We will use the below demo code to compare the source unencrypted file and the decrypted file and prove that our decryption is working correctly with no corruption. This should not be put in the actual production notebook as you will not have original source file to compare with.

# This compares the original file with the decrypted file.

import sys
import hashlib

def hashfile(file):
# A arbitrary (but fixed) buffer size (change accordingly)
# 65536 = 65536 bytes = 64 kilobytes
BUF_SIZE = 65536
# Initializing the sha256() method
sha256 = hashlib.sha256()
# Opening the file provided as
# the first commandline argument
with open(file, 'rb') as f:
while True:
# reading data = BUF_SIZE from the file and saving it in a
# variable
data = f.read(BUF_SIZE)
# True if eof = 1
if not data:
break
# Passing that data to that sh256 hash function (updating the function with that data)
sha256.update(data)

# sha256.hexdigest() hashes all the input
# data passed to the sha256() via sha256.update()
# Acts as a finalize method, after which
# all the input data gets hashed hexdigest()
# hashes the data, and returns the output
# in hexadecimal format
return sha256.hexdigest()

# Calling hashfile() function to obtain hashes of the files, and saving the result in a variable
Initial_hash = hashfile('/dbfs/mnt/anuadlstest/test.csv')
Decrypted_hash = hashfile('/dbfs/mnt/anuadlstest/testdecrypt.csv')

# Doing primitive string comparison to
# check whether the two hashes match or not
if Initial_hash == Decrypted_hash:
print("Both files are same")
print(f"Hash: {Initial_hash}")
else:
print("Files are different!")
print(f"Hash of File 1: {Initial_hash}")
print(f"Hash of File 2: {Decrypted_hash}")

If the above code produces a result “Both files are same”, this means that our decryption is working correctly.

Below is another example of key generation using a username and passphrase.

#https://evotec.xyz/encrypting-and-decrypting-pgp-using-powershell/
#https://github.com/EvotecIT/PSPGP
$basepath = "C:\Users\Anupam.Chand\OneDrive - Shell\Documents\Work\Architecture stuff\Azure\SQL\Powershell\PSPGP"
$pubpath = $basepath + "\PublicPGP.asc"
$pripath = $basepath + "\PrivatePGP.asc"
$username = "anutestusername"
$password = "testpassword"
$strength = 2048 # 2048 3072 4096 in bits
$blank = ""
function new_key(){
# Generate a private and public key pair for specified strength
New-PGPKey -Strength $strength -Certainty $blank -FilePathPublic $pubpath -FilePathPrivate $pripath -UserName $username -Password $password
}
function encrypt_folder(){
Protect-PGP -FilePathPublic $pubpath -FolderPath $basepath\Test -OutputFolderPath $basepath\Encoded
}
function decrypt_folder(){
Unprotect-PGP -FilePathPrivate $pripath -Password $password -FolderPath $basepath\Encoded -OutputFolderPath $basepath\Decoded
}
function encrypt_string($string){
$ProtectedString = Protect-PGP -FilePathPublic $pubpath -String $string
$ProtectedString
}
function decrypt_string($ProtectedString){
$Decrypted_string = Unprotect-PGP -FilePathPrivate $pripath -Password $password -String $ProtectedString
$Decrypted_string
}
function convert_b64(){
# Convert Private key to base64 encoded format to store in Key vault
$text = Get-Content $pripath -Encoding Byte
$base64 = [System.Convert]::ToBase64String($text)
$base64
}
new_key
#encrypt_string("This is a secret")
$string = @"
xxxxxxxx
"@
#decrypt_string($string)
#encrypt_folder
#decrypt_folder
convert_b64

Decrypting large files

Using Pgpy with large files (=>1GB) can lead to memory issues in databricks as there are no streaming decryption options available. The solution is to use gpg in a bash script on a databricks notebook which can be called from ADF.
Python cell

%python
import os
mykey = <<< convert the base24 version of the private key into ASCII format as shown above >>>
os.environ['mykey'] = mykey

Bash script where the private key was done without a passphrase

%sh
gpg --no-tty --batch --import $mykey
gpg --no-tty --batch --yes --ignore-mdc-error --pinentry-mode=loopback --output /dbfs/mnt/testblob/output_file.csv --decrypt /dbfs/mnt/testblob/input_file.csv.pgp

Bash script when the private key was done with a passphrase

%sh
gpg --no-tty --batch --import $mykey
gpg --no-tty --batch --yes --passphrase "<<your_passphrase>>" --ignore-mdc-error --pinentry-mode=loopback --output /dbfs/mnt/testblob/output_file.csv --decrypt /dbfs/mnt/testblob/input_file.csv.pgp

All private keys and passphrases should be stored in Key vault only.

--

--

Anupam Chand
Anupam Chand

Written by Anupam Chand

IT Solution architect with an interest in cloud computing and anything geeky and techy.

Responses (1)