Problem when encrypting files with special characters in file name

Hello!

I am new to Cryptomator. I want to encrypt my files locally and to upload the encrypted files to a cloud service. As a first step, I have created a big vault where I have copied roughly 50 GB of data and 150.000 files.

When I compare the original files and the files in the vault, there are 136 files which have not been copied to the vault. To be precise: The 136 files appear on both the original directories and the vault directories. It looks like the files names are the same, but my synchronization software (I use TotalCommander) tells that the file name or the directory name differs.

What all problematic files and directories have in common: They have special characters in their names (a German “Umlaut” like ä, ö, ü, or a Slavic s with caron like š). It looks as if Cryptomator changes the encoding of the file name, so the synchronization software regards them as different file names, even if their checksum is the same.

There are thousands of files in the vault with special characters in their file name, so this happens only with very few files. It is still very annoying, because it prevents synchronization. I could delete the files in the original directories and copy the files from the Cryptomator vault. But I feel uneasy about doing this, and I would like to know why this happens.

I have started Cryptomator in debug mode, and tried copying three of the problematic files. In just 5 minutes, Cryptomator has created a log file with 37.000 lines and 4 MB size. I cannot remove all file names (sensitive information), it would be too much work. But as an example, I found the file name in the log file as “Schl𴥲ter” instead of “Schlüter”.

What can I search for in the log file?
Has someone else found this problem?

I use Windows 10 Pro in English. Regarding Cryptomator, I use version 1.4.2 with Dokany. I already had this problem with version 1.4.0, and I updated to 1.4.2. to see if the problem was solved, but unfortunately it was not.

Thank you very much in advance for your help and comments!

Thanks for reporting this. We’ve already seen this issue before on macOS and solved it there. It turns out there are different ways in UTF-8 to encode umlauts as I’ve explained in this issue:

Apparently this is also relevant on Windows with Dokany. @infeo Let’s try to replicate this and create an issue in the dokany-nio-adapter similar to fuse-nio-adapter#27.

Thank you for your reply and for looking at the issue. I had searched the knowledge base, but not for “unicode”.

In your reply you mention different OS. To make it clear: Everything happens in a computer under Windows 10 Pro. I am copying files and directories from an NTFS partition A to a Cryptomator vault located on a different NTFS partition B. So I am not really sure what you mean with “different OS”: one OS is Windows 10 as installed on the computer, and the second OS is the pair Cryptomator/Dokany?

The irritating thing is that this happens only with the names of very few files. So it may be difficult to replicate.

You mention that there are different ways in UTF-8 to encode umlauts.
Is there a way I can check the problematic files?
Is there a way to make sure Windows encodes umlauts in a consistent way?

Please let me know if I can help with providing more information, testing, etc.

Many thanks again!

You can copy and paste the following two filenames and see if it makes any difference:

NFD-encoded: ü.txt
NFC-encoded: ü.txt

(let’s hope that neither the browser nor the clipboard normalizes both forms to an equal byte representation)

Windows should prefer the latter. It is common practice to use NFC whenever some text data is visible to the user.

overheadhunter, thank you so much! You nailed the problem. Neither the browser nor the clipboard were a problem.

I tried the NFC-encoded file name (in short, NFC file) and the NFD-encoded file name (in short, NFD file).
When I copied the NFC file to the Cryptomator vault, everything was fine.
When I copied the NFD file to the Cryptomator vault, it looks like the file in the vault has the same name as the original file, but the file synchronizing software says the file names are different.
So the problem lies with the NFD encoded file when copied to the Cryptomator vault.

Is this a problem or a bug with Windows or with Cryptomator?
Why do I have a few files with NFD-encoding and almost all are NFC-encoded? I only work with Windows, no MacOS, no Linux.
How can I find out which file names use NFD-encoding and, once found, can I easily change the encoding to NFC-encoding? Or would a file synchronizing software then say that the files have different names?

Many thanks again!

Yes. :stuck_out_tongue_winking_eye:

It is unclear who should be responsible for handling such differently encoded files. Since we should not rely on anybody else to fix it, we will need to add a normalization layer to our Dokany implementation to change all file names to NFC. This should avoid such problems.

Maybe you downloaded them from somebody working on a Mac?

Difficult without looking into the byte representation of the text using a hex editor.

Once identified, it is fairly easy to change the encoding. Just rename the file and type the umlaut using your keyboard. Windows will then use its default encoding.

Hello overheadhunter,

thank you for your reply and apologies for my late reply!

I understand. I hope you can implement such a normalization layer and it is not a lot of work. In any case, you have identified the problem very quickly, and since it only affects a few files, I was able to solve the problem manually.

Indeed I got some of the files and directories from a colleague that uses a Mac. But this is not true for all of the files. It doesn’t really matter :grinning:

I did not look into the byte representation of the text using a hex editor. But I calculated an MD5 checksum for the two text files with different encoding in their file names. And it turned out that the checksum was identical. Still my file synchronising software classified them (now I know correctly) as different files …

Thank you for this suggestion. This is exactly what I did, and this solved the problem.

Again, having the normalization layer would be great, but at least the issue and the problem have been identified.

Many thanks again!

I am encountering some other issues with Cryptomator, which I will raise in some other threads, some of them already existing.

Checksum is only based on file contents. You can rename a file, the checksum stays the same.