BCP is great because it is fast and extremely reliable.
But it has some serious limitations:
- it cannot escape delimiters or use quoting to handle fields that contain delimiters (so if you data has tabs, you can’t make a tsv)
- it cannot pipe to stdout (so no compression and no streaming to s3 out of the box)
Thankfully, however, bcp runs on linux. So, despite these limitations, with a little bash trickery, we can use BCP to do the following:
- stream a very large table to s3
- split every 10,000,000 rows
- use gzip compression
- use standard delimiters
\t
and\n
with escape character\
.
I’ll spare you the journey of discovery and just give you the goods. Then I’ll explain the choices afterwards.
The solution
#!/bin/basha=$'\001' # SOH
b=$'\002' # STX
c=$'\003' # ETX
ft="$a$b$b$a"
rt="$a$c$c$a"
awk_command_elements=(
'BEGIN {RS=rt;FS=ft;OFS="\t";ORS="\0"}'
'{gsub(/\0/,"");'
'gsub(/\\/,"\\\\");'
'gsub(/\n/,"\\\n");'
'gsub(/\t/,"\\\t");'
'$1=$1}1'
)
full_awk_command=$(IFS=; echo "${awk_command_elements[*]}")
tr_replace_nul='tr "\0" "\n"' # replace NUL with newlines3_bucket='my-bucket'
s3_key='my-table-$FILE.csv.gz' # $FILE will be used in split commandtfifo=$(mktemp -d)/fifo
mkfifo $tfifo
bcp "
SELECT *
FROM db01.dbo.table01 s (nolock)
" \
queryout $tfifo \
-S server01 \
-U my_user \
-P "$MY_PASS" \
-a 65535 \
-c \
-r $rt \
-t $ft &
pid=$! # get PID of backgrounded bcp process
count=$(ps -p $pid -o pid= |wc -l) # check whether process is still running
if [[ $count -eq 0 ]] # if process is already terminated, something went wrong
then
echo "something went wrong with bcp command"
rm $tfifo
wait $pid
exit $?
else
echo "bcp command still running"
cat $tfifo | \
awk \
-v rt=$rt \
-v ft=$ft \
"$full_awk_command" | \
split \
-l 10000000 \
-t "\0" \
--filter="${tr_replace_nul} | gzip | aws s3 cp - s3://${s3_bucket}/${s3_key}" \
&& rm $tfifo
exit $?
fi
Is this ugly? Absolutely. If support for escaping delimiters and piping to stdout is added to bcp, then such a hack will not be needed. In the meantime, this will do the job.
Discussion below.
Streaming to s3
AWS’s command line utility allows you to copy to s3 from stdout. So e.g. you can do echo 'abc' | aws s3 cp — s3://my-bucket/my-file.txt
.
But BCP cannot pipe to stdout. The hack to get around this is to use a fifo file.
- create a fifo file and store the filename in variable
$tfifo
- start writing to the fifo file with
bcp … queryout $tfifo …
, and background the process - then use
aws s3 cp $tfifo s3:/...
to stream from bcp using the fifo file.
Delimiters
BCP cannot escape delimiters. Consequently you need to use delimiters that will never appear in your data. You might have luck with the control characters 0x01
and 0x02
. But in my experience, if you have enough data eventually you’ll encounter these in your data. So what I have had luck with is the multi-character delimiters defined in this block:
a=$'\001' # a.k.a. 0x01 / SOH / ^A
b=$'\002' # a.k.a. 0x02 / STX / ^B
c=$'\003' # a.k.a. 0x03 / ETX / ^C
ft="$a$b$b$a" # field terminator
rt="$a$c$c$a" # row terminator
There is nothing special about these delimiters apart from the fact they likely won’t be found in the wild.
Now, you could just leave your data in s3 with these multi-character delimiters. But there are are a few problems with this.
For one, you cannot use split
with these delimiters. The bash function split
is what lets us split files every N rows. And the latest version can use any single character as a row terminator. However, split
does not support multi-character delimiters.
The other problem is that using a non-standard delimiter means you have to remember what delimiter you used and it may not be obvious. And it might not play nicely with certain tools.
One solution is we can use awk
to do what bcp cannot; while processing the stream, we do the following:
- remove all
\0
characters - escape the escape character
\
- escape the tab character
\t
- escape the newline character
\n
- replace the funny delimiters
^A^B^B^A
and^A^C^C^A
with\t
and\0
- within each split, we replace
\0
with\n
The result is ugly, but sometimes that’s what you have to do.
File splitting
So, after our stream makes it through awk
, we’re just about home free. Now we have a stream of data from BCP, with the delimiters properly escaped. All that remains is to implement the file splitting component.
File splitting (or using splittable compression) enables your compute platform to parallelize load operations. It also means you won’t run up against file part limitations in a multi-part upload etc.
In our awk command, we did not replace ^A^B^B^A
with \n
immediately; even though we escaped newline characters in the data, the split
command doesn’t understand escaping; so we could end up with a single row spanning multiple files. To resolve this, coming out of awk
we use \0
as row separator, and only within the split filter do we replace \0
with \n
.
The function split
will generate a filename part $FILE
which is incorporated into the s3 key. Split is also where we gzip.
Other notes
If you are moving a lot of data, you may want to increase the multipart_chunksize
parameter in your aws s3 configuration.