Unicode Character problems in JSON and playing with BOM (Byte Order Mark)

Ferhat Karataş
3 min readMay 8, 2021

--

Photo by Ferenc Almasi on Unsplash

We are currently working on a JSON API which endpoint is developed on C# and the request is made by PHP. I wanted to share what I learned during this integration process. Here we go!

Let’s start our article with a question.

Is it enough to set “application / json; charset = utf-8” as the Content Type of the page to solve Unicode problems?

Let’s relieve the developer who came here by saying what we gonna say the end of the day.

The answer is NO !

Because you cannot fix the encoding type of the page by playing with Content Type.

So what is needed to prevent Unicode characters from appearing as question marks?

The answer is set the Content Encoding not Content Type !

Of course, this has its unique syntax in every language. It is enough to revise this line according to which language you are writing the application in.

Let’s see how it is used by putting the full version of the code here:

Note: If you say (nt, Formatting.Indented), \ r \ n marks will appear after each semicolon on the side that will decode this endpoint. \ r corresponds to the Enter character (ascii 13) and \ n to the Newline character (ascii 10). Therefore, it would not be wise for developers who design endpoints to format and shape JSON files. Remember that every move you make will create decode problems on the opposite side.

Well, if the company does not fix the Unicode character problem in the JSON endpoint knowingly or due to the company structure, can the reader who decoded it solve the problem on their side?

Yes, you can set the Encoding type while doing this reading;

This is the final version:

After mentioning the subject that should be emphasized in this article, let’s go into some details.

How can we debug the error during processing?

The main purpose here is to find the cause of the error and tell the endpoint what needs to be corrected.

Let’s make a debug and observe the value :

When you debug the PHP code, you see that the $json data is fine, no leading or trailing spaces, no unreadable characters in its content, you think :)

Copy the value and paste it into the textbox at this URL. https://apps.timwhitlock.info/unicode/inspect

What !? :)

ZERO WIDTH NO-BREAK SPACE

Have you ever heard of it before?

It can also be called BYTE ORDER MARK

Legacy name (Unicode 1.0) BYTE ORDER MARK

Official name (Unicode 9.0) ZERO WIDTH NO-BREAK SPACE

Click the radio button (Dec) in the second row you see in the picture.

You will see that it consists of the numbers 239 187 191 as UTF-8. We have already found this with a loop below.

Let’s see immediately;

Result:

? = 239

? = 187

? = 191

Let’s replace this character;

Navigate the url below;

https://unicode-table.com/en/FEFF/

Get the character we will replace by pressing the COPY key from this page. And paste it inside the quotes on the str_replace line.

Its image in the PHP Storm IDE:

Pay attention to ZWNBSP (Zero Width No-Break Space)

Its image on VS Code:

Yes, we have come to the end of a long, tiring but enjoyable analysis.

See you in another article;)

--

--

Ferhat Karataş
0 Followers

Application Developer, Dad, Entrepreneur, PhD at Computer Science