These days there are plenty of options, so I went out and bought an brother MFC-7860DW. I hadn't researched this heavily, so don't consider this a review. It's a scanner/printer that can handle wifi (handy where I wanted to put it), print duplex (yay), and can scan via ADF (automatic document feed, aka sheetfed) straight to a pre-configured FTP location. I overlooked one feature, duplex ADF scanning, which I wish I had.
Now that I have a scanner that uploads to a local FTP drive, I wanted more. I wanted to then automatically upload to a Google Drive folder. Google Drive has very good automatic OCR of all text in uploaded scans, is accessible anywhere, and it's much better than an FTP directory in terms of usability. I can even load documents on a cell phone app.
I started a small project that would monitor for new files on the FTP drive and then upload them to Google Drive. I thought this would take an afternoon, but 600 lines of code later and I'm only now feeling done with the project. I did throw in learning go as part of this, which didn't help things.
The project is here if it inspires anyone: https://github.com/Gregable/ScanServer
One of the big issues I didn't think about initially was that scanned files wouldn't be instantly "complete". It would take several seconds between a file being created and finished uploading from the scanner. If I started processing the file too quickly, I'd get a partial upload. If I slept too long, I'd add too much latency to the whole process. I wanted to largely be able to look at the uploaded file seconds after scanning it so I could verify that it looked right before disposing of (shredding) the scanned document, so latency wasn't a great idea.
Another issue was scanning duplex. I thought it should be possible to scan two single-sided documents and have my script merge them. As it turns out, this is possible, though tricky. If you flip a multiple-page document over, the page order is reversed. As a result, you need to interleave pages backwards. Something like:
- Front doc, Page 1
- Back doc, Page 3
- Front doc, Page 2
- Back doc, Page 2
- Front doc, Page 3
- Back doc Page 1
My scanner had the ability using buttons to select different filename prefixes for uploaded docs, things like "laser", "receipt", "estimate", I ended up choosing one of the prefixes to indicate duplex docs. I then had to think about error cases. What if a user accidentally scans only the front half as duplex and marks the back half as single-sided. Or vice-versa. In these cases, I wanted to upload the document as two single-sided documents. What if I saw two duplex documents come in, but they had different numbers of pages. In this case, I wanted to treat the first one as a single-sided document, but keep the second as potentially the first side of a duplex document that might come next. What if I saw a single duplex document come in and nothing followed? In this case, I wanted to wait 15 minutes to see if I got the other side of the duplex document, and if not give up and treat it as a single-sided document. These cases were fun to work through.
Ultimately, I had fun implementing this with go channels and goroutines. I had one goroutine in a loop monitoring for new ftp documents and managing waiting until they seem to be "finished" uploading before outputting the document to a channel. I had another goroutine sorting out the duplex / single-sided issues, creating merged documents from duplex uploads, and handling the user error cases above. This goroutine spit out the documents to be uploaded to Drive into another channel. A third goroutine uploaded to drive and spit out to a "final" channel which cleaned up any temporary files created and echoed success messages to the console.
I also moved this little project over to a raspberry pi. As a result, I don't even need a computer running to have this setup going which considering the ~3W power draw of a raspberry pi makes it a no-brainer to leave running continuously.