Had a recent issue come up where a customer was trying to use the Python Library twobitreader in a UDF to pull out some genetic information for individual genes. Think of it like being able to look up a range of characters from a file and output them as a string. The problem they were running into is, it is shockingly difficult to read from a single file in a UDF as the file has to be on all the nodes.
We do have documentation outlining how to use the mssparkutils mount API but it misses some of the important steps needed (in my opinion) to make this easy to use. Including not making it extremely obvious that the folder you "mount" isn't actually the folder name you will be using to access the file system.
I have an example spark notebook that outlines using the mount API to read directly from a file on GitHub but let me give you the important bit:
Mounting the filesystem
The first step is to mount the file system as a folder using mssparkutils.fs, you can use a linked service so you don't have to share credentials.
mssparkutils.fs.mount(
"abfss://cabattag@cabattagsyn.dfs.core.windows.net/", # Your ADLS Gen2 account in container@account.fqdn format
"/udffolder", # The folder where you want the container mounted (with caveats shown below)
{"linkedService":"cabattag-synapse-WorkspaceDefaultStorage"} # The linked service that can access the ADLS account
)
Accessing the filesystem
This is the part I feel isn't well documented. The path is not what you mounted above, it is /synfs/, followed by the spark jobId from mssparkutils.env.getJobId()
, followed by your path.
jobId = mssparkutils.env.getJobId()
# Creating our file path in the format /synfs/{jobId}/{mountFolder}/{fileName}
filepath = "/synfs/%s/udffolder/UDFTest/%s" %(jobId, "udfread.txt")
#Use the UDF from above to look up the characters in the file and add them as a column
df.withColumn("foundstring", getStringFromFile(col("beginning"), col("end"), lit(filepath))) \
.show(truncate=False)