Comment on Why would a UTF-8 MySQL backup contain invalid UTF-8 characters?
bjoern_tantau@swg-empire.de 2 weeks ago
Encoding is hard. Especially when your data comes from web forms or CSV files. And MySQL needed three tries to get UTF-8 right and you need DB admins and often programmers as well who know this. So not everything MySQL calls UTF-8 actually is.
And often enough it took a long while for something to actually reach UTF-8 status. And idiots not converting the data leads to databases with a mixture of encodings.
undefined@lemmy.hogru.ch 2 weeks ago
I guess what gets me is that it’s writing to a UTF-8 file so you’d expect that file to contain UTF-8 only. Hell, I’d take UTF-8 with Base64 encoded data for binary data over the hodgepodge .sql file coming out of the thing.
bjoern_tantau@swg-empire.de 2 weeks ago
There is no such thing* as a UTF-8 file. It’s just text encoded in some way. It’s only a UTF-8 file if everything is encoded as UTF-8 which it’s evidently not.
You can even tell MySQL to export perfectly valid UTF-8 text encoded as ISO 8859-1 to import into a UTF-8 table without any troubles (maybe apart from stuff that could not be encoded in ISO 8859-1).
*Yes, technically there could be a BOM at the beginning but almost no tool uses that and most get confused by it. And it would still not force any data written to it to be UTF-8.
folekaule@lemmy.world 2 weeks ago
The Unicode standard allows, but recommends against, adding a BOM for utf8 files. Utf8 does not need them.
I’ve only seen Microsoft tools adding that, and it breaks some parsers.
Please don’t add BOM to utf8 files unless for some reason you need them.
undefined@lemmy.hogru.ch 2 weeks ago
Right, but if you’re telling the software to encode a file as UTF-8 maybe the software should actually encode it as UTF-8.