Streams in node are awesome. I’ve heard one of the core node contributors say, “if you’re not using streams, you’re doing it wrong”. Why are streams awesome? I’m glad you asked.
- improve performance for large data sets—stream from source to response without buffering it all in memory
- improve perceived responsiveness—start responding as soon as any data is available, rather than waiting for it all
- clean service composition pattern—make a request to another service and stream its response back to the user, filtering or modifying it in some way in the process
However, streaming in node isn’t straightforward. This is more true with the introduction of Stream2, aka “new streams”, since module writers try to support both old and new streams.
I ran into this challenge the other day. After hours of debugging, I realized that mixing stream modes was the cause and I somehow stumbled onto a stupid simple solution.
The goal was to stream Javascript arrays into node-csv and then use node-ftp to upload the resulting CSV to a file. For some reason, though, the first line or two were always missing from the created file.
Here’s a small bit of test code illustrating the problem.
var stream = require('stream'), FTP = require('ftp'), CSV = require('csv'), util = require('util'); var ftp = new FTP(), data = new stream.PassThrough({ objectMode: true }); // write something before ftp connection data.write(['a','b','c']); data.write(['d','e','f']); ftp.on('ready', function () { ftp.put(data.pipe(CSV()), 'my-data.csv', function (err) { ftp.end(); if (err) { util.log('Got an FTP upload error ' + err.message); } }); }); ftp.on('error', function (err) { util.log('Got an FTP connection error ' + err.message); }); ftp.connect({ host: 'localhost', user: 'codyaray', password: 'youwish' }); data.write(['g','h','i']); data.write(['j','k','l']); data.write(['m','n','o']); data.end(); |
Running this script produces the following output:
$ cat my-data.csv
d,e,f
g,h,i
j,k,l
m,n,o |
Notice its missing the first row? If you tell the CSV module to add a header, it’ll be missing the header and the first data row. This happens regardless of whether or not you write into the data source before the FTP connection. (You can see that because “d,e,f” is in the output file.)
It turns out that the CSV module is emitting the data before the FTP module is ready for it. After several hours of banging my head on the table, I surmised that the CSV module and the FTP module try to handle both old and new streams, and do so in incompatible ways. A couple more hours of banging, and I stumbled onto this solution.
Add a PassThrough stream between the incompatible modules (node-csv and node-ftp).
So we change this
ftp.put(data.pipe(CSV()), 'my-data.csv', function (err) { ... }); |
to this
ftp.put(data.pipe(CSV()).pipe(stream.PassThrough()), 'my-data.csv', function (err) { ... }); |
Really, that’s it. My theory is that this works because the built-in PassThrough stream is compatible with both the old and new streaming modes, and by playing middleman it can help “translate” between the CSV and FTP modules.
Do you have a better explanation for why this works?
0 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.