All The Whitespace Characters? Is It Language Independent?


Answer :

Whether a particular character is categorized as a whitespace character or not should depend on the character set being used. That said, it is not impossible that a programming language can make its own definition of what constitutes whitespace.

Most modern languages use the Unicode Character set, which does have a definition for space separator characters. Any character in the Zs category is a space separator.

You can see the complete list here. In addition you can grep for ;Zs; in the official Unicode Character Database to see those characters. Note that the number of characters in this category may grow as new Unicode versions come into existence, so I will not say how many such characters exist, nor even attempt to list them.

In addition to the Zs Unicode category, Unicode also defines character properties. Among the properties defined by Unicode is a Whitespace property. As of Unicode 7.0, characters with this property include all of the characters with category Zs plus a few control characters (including U+0009, U+000A, U+000B, U+000C, U+000D, and U+0085). You can find all of the characters with the whitespace property at Unicode.org here.

Now many languages, even modern ones, have special symbols for regular expressions such as \s or [:space:] but beware, these only refer to certain characters from the ASCII set; generally these are restricted to

  • SPACE (codepoint 32, U+0020)
  • TAB (codepoint 9, U+0009)
  • LINE FEED (codepoint 10, U+000A)
  • LINE TABULATION (codepoint 11, U+000B)
  • FORM FEED (codepoint 12, U+000C)
  • CARRIAGE RETURN (codepoint 13, U+000D)

Now this list is interesting because it contains not only space separators (Zs), but also from the "Control, Other" category (Cc). This is what a programming language generally means when it uses the term "whitespace."

So probably the best way to answer your question for a "complete list" of whitespace characters is to say "it depends on what you mean." If you mean "classic whitespace" it is probably the six characters listed above. If you want something more "modern" then it is the union of those six with all the characters from the Unicode category Zs. Then again, you might need to look within other blocks, too (e.g., U+1361 as mentioned in a comment to your question by Jerry Coffin). It also depends on what you intend to do with these space characters.

Now one last thing: Unicode doesn't have every character in the world yet; it keeps growing. It is possible that someday new space characters will be added. For now, category Zs + the classics are your best bet.


There are currently 25 Unicode whitespace characters with the following hexadecimal 'code points':

9, A, B, C, D, 20, 85, A0, 1680, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 200A, 2028, 2029, 202F, 205F, 3000 

Corresponding decimal values are:

9, 10, 11, 12, 13, 32, 133, 160, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287, 12288 

I originally acquired this information from Unicode.org, but my old link is no longer a working URL. Wikipedia has a nice page on the subject tho, at https://en.wikipedia.org/wiki/Whitespace_character if any are interested, which also gives 25 characters. (I have not cross-referenced that these characters are the same characters, but i trust that the Unicode Consortium has not made such a breaking, major change to their character set!)

I did find one simple page on unicode's website today, but it looks a bit more like a draft html page rather than anything supporting or claiming an official stance. But it does match what Unicode had previously posted as an official claim regarding what all of their whitespace characters are. (The link is in my comment below my answer.)


Comments

Popular posts from this blog

Are Regular VACUUM ANALYZE Still Recommended Under 9.1?

Can Feynman Diagrams Be Used To Represent Any Perturbation Theory?