How to stream Microsoft SQL Server to S3 using BCP on linux

Daniel
4 min readFeb 5, 2020

BCP is great because it is fast and extremely reliable.

But it has some serious limitations:

  • it cannot escape delimiters or use quoting to handle fields that contain delimiters (so if you data has tabs, you can’t make a tsv)
  • it cannot pipe to stdout (so no compression and no streaming to s3 out of the box)

Thankfully, however, bcp runs on linux. So, despite these limitations, with a little bash trickery, we can use BCP to do the following:

  • stream a very large table to s3
  • split every 10,000,000 rows
  • use gzip compression
  • use standard delimiters \t and \n with escape character \ .

I’ll spare you the journey of discovery and just give you the goods. Then I’ll explain the choices afterwards.

The solution

#!/bin/basha=$'\001'  # SOH
b=$'\002' # STX
c=$'\003' # ETX
ft="$a$b$b$a"
rt="$a$c$c$a"
awk_command_elements=(
'BEGIN {RS=rt;FS=ft;OFS="\t";ORS="\0"}'
'{gsub(/\0/,"");'
'gsub(/\\/,"\\\\");'
'gsub(/\n/,"\\\n");'
'gsub(/\t/,"\\\t");'
'$1=$1}1'
)
full_awk_command=$(IFS=; echo "${awk_command_elements[*]}")
tr_replace_nul='tr "\0" "\n"' # replace NUL with newline
s3_bucket='my-bucket'
s3_key='my-table-$FILE.csv.gz' # $FILE will be used in split command
tfifo=$(mktemp -d)/fifo
mkfifo $tfifo
bcp "
SELECT *
FROM db01.dbo.table01 s (nolock)
" \
queryout $tfifo \
-S server01 \
-U my_user \
-P "$MY_PASS" \
-a 65535 \
-c \
-r $rt \
-t $ft &
pid=$! # get PID of backgrounded bcp process
count=$(ps -p $pid -o pid= |wc -l) # check whether process is still running
if [[ $count -eq 0 ]] # if process is already terminated, something went wrong
then
echo "something went wrong with bcp command"
rm $tfifo
wait $pid
exit $?
else
echo "bcp command still running"
cat $tfifo | \
awk \
-v rt=$rt \
-v ft=$ft \
"$full_awk_command" | \
split \
-l 10000000 \
-t "\0" \
--filter="${tr_replace_nul} | gzip | aws s3 cp - s3://${s3_bucket}/${s3_key}" \
&& rm $tfifo
exit $?
fi

Is this ugly? Absolutely. If support for escaping delimiters and piping to stdout is added to bcp, then such a hack will not be needed. In the meantime, this will do the job.

Discussion below.

Streaming to s3

AWS’s command line utility allows you to copy to s3 from stdout. So e.g. you can do echo 'abc' | aws s3 cp — s3://my-bucket/my-file.txt .

But BCP cannot pipe to stdout. The hack to get around this is to use a fifo file.

  • create a fifo file and store the filename in variable $tfifo
  • start writing to the fifo file with bcp … queryout $tfifo …, and background the process
  • then use aws s3 cp $tfifo s3:/... to stream from bcp using the fifo file.

Delimiters

BCP cannot escape delimiters. Consequently you need to use delimiters that will never appear in your data. You might have luck with the control characters 0x01 and 0x02 . But in my experience, if you have enough data eventually you’ll encounter these in your data. So what I have had luck with is the multi-character delimiters defined in this block:

a=$'\001'  # a.k.a. 0x01 / SOH / ^A
b=$'\002' # a.k.a. 0x02 / STX / ^B
c=$'\003' # a.k.a. 0x03 / ETX / ^C
ft="$a$b$b$a" # field terminator
rt="$a$c$c$a" # row terminator

There is nothing special about these delimiters apart from the fact they likely won’t be found in the wild.

Now, you could just leave your data in s3 with these multi-character delimiters. But there are are a few problems with this.

For one, you cannot use split with these delimiters. The bash function split is what lets us split files every N rows. And the latest version can use any single character as a row terminator. However, split does not support multi-character delimiters.

The other problem is that using a non-standard delimiter means you have to remember what delimiter you used and it may not be obvious. And it might not play nicely with certain tools.

One solution is we can use awk to do what bcp cannot; while processing the stream, we do the following:

  • remove all \0 characters
  • escape the escape character \
  • escape the tab character \t
  • escape the newline character \n
  • replace the funny delimiters ^A^B^B^A and ^A^C^C^A with\t and \0
  • within each split, we replace \0 with \n

The result is ugly, but sometimes that’s what you have to do.

File splitting

So, after our stream makes it through awk , we’re just about home free. Now we have a stream of data from BCP, with the delimiters properly escaped. All that remains is to implement the file splitting component.

File splitting (or using splittable compression) enables your compute platform to parallelize load operations. It also means you won’t run up against file part limitations in a multi-part upload etc.

In our awk command, we did not replace ^A^B^B^A with \n immediately; even though we escaped newline characters in the data, the split command doesn’t understand escaping; so we could end up with a single row spanning multiple files. To resolve this, coming out of awk we use \0 as row separator, and only within the split filter do we replace \0 with \n.

The function split will generate a filename part $FILE which is incorporated into the s3 key. Split is also where we gzip.

Other notes

If you are moving a lot of data, you may want to increase the multipart_chunksize parameter in your aws s3 configuration.

--

--