=begin rant

For me, I absolutely hate all this encoding stuff in ruby 1.9, and I'll try
to explain why here.

* As a programmer, the most important thing for me is to be able to reason
  about the code I write.  Reasoning tells me whether the code I write is
  likely to run, terminate, and give the result I want.
  
  In ruby 1.8, if I write an expression like "s3 = s1 + s2", where s1 and s2
  are strings, this is easy because it's a one-dimensional space.
  
               s3     =     s1   +   s2


             ----->       ----->    ----->
             string       string    string

  As long as s1 and s2 are strings, then I know that s3 will be a string,
  consisting of the bytes from s1 followed by the bytes from s2. End of
  story, move to the next line.

  But in ruby 1.9, it becomes a multi-dimensional problem:

               s3     =     s1   +   s2

          enc^         enc^      enc^
             |            |         |
             |            |         |
             +---->       +---->    +---->
             string       string    string
  
  The number possibilities now explodes. What are the possible encodings
  that s1 might have at this point in the program? What are the possible
  encodings that s2 might have at this point? Are they compatible, or will
  an exception be raised? What encoding will s3 have going forward in the
  next line of the program?

  The reasoning is made even harder because the *content* of the strings is
  also a dimension in this logic.  s1 and s2 might have different encodings
  but could still be compatible, depending on whether they are empty or
  consist only of 7-bit characters, as well as whether they are tagged with
  an ASCII-compatible encoding.  The encoding of s3 also depends on all
  these factors, including whether s1 is empty or s2 is empty.

  Analysing a multi-line program then multiplies this further, as you need
  to carry forward this additional state in your head to where it is next
  used.

* Now try reasoning about a program which makes uses of strings returned by
  library functions (core or third party), where those functions almost
  never document what encoding the string will be tagged with.  You need to
  guess or test what encoding you get, and/or reason that the encoding
  actually doesn't matter at this point in the program, because of what you
  know about the encodings of other strings it will be combined with.

* Whether or not you can reason about whether your program works, you will
  want to test it. 'Unit testing' is generally done by running the code with
  some representative inputs, and checking if the output is what you expect.
  
  Again, with 1.8 and the simple line above, this was easy. Give it any two
  strings and you will have sufficient test coverage.
  
  With 1.9, there is an explosion of test cases if you want to get proper
  coverage: the number of different encodings and string contents
  (empty/ascii/non-ascii) you expect to see for s1, multiplied by the same
  for s2, plus testing the encoding of the results.

* Unless you take explicit steps to avoid it, the behaviour of a program
  under ruby 1.9 may vary depending on what system it is run on.  That is,
  the *same* program running with the exact *same* data and the *same*
  version of ruby can behave *differently* on different systems, even
  crashing on one where it worked on the other.  This is because, by
  default, ruby uses values from the environment to set the encodings of
  strings read from files.
  
  It's possible to override ruby's policy on this, but it requires
  remembering to use some incantations.  If you accidentally omit one, the
  program may still work on your system but not on someone else's.

* It's ridiculously complicated. string19.rb contains around 200 examples of
  behaviour, and could form the basis of a small book.  It's a +String+ for
  crying out loud!  What other language requires you to understand this
  level of complexity just to work with strings?!  The behaviour is full of
  arbitrary rules and inconsistencies (like /abc/ having encoding US-ASCII
  whilst "abc" having the source encoding, and some string methods raising
  exceptions on invalid encodings and others not)

* It's buggy as hell. I found loads of bugs just in the process of
  documenting this. To me this two could imply two things:
  
  - even Ruby's creators, who are extremely bright people, don't understand
    their own rules sufficiently to implement them properly. In that case,
    what chance do the rest of us have?
    
  - very few people are actually using this functionality, in which case,
    what's it doing as a core part of the language?

* Of course, it's very hard to categorise something as a "bug" if you don't
  know what the intended behaviour is.  Almost all the behaviour given in
  string19.rb is undocumented.  By that I mean: when I look at the
  documentation for String#+, I expect at minimum to be told what it
  requires for valid input (i.e.  under what circumstances it will raise an
  exception), and what the properties of the result are.
  
  If Ruby has any ideas of becoming a standardised language, the ISO and
  ANSI committees will laugh down the corridor until all this is formally
  specified (and what I have written here doesn't come close)

* Even when I explicitly tag an object as "BINARY", Ruby tells me it's
  "ASCII-8BIT".  This may seem like a minor issue, but it annoys me
  intensely to be contradicted by the language like this, when it is
  so blatently wrong. All text is data; the converse is *not* true.

* It solves a non-problem: how to write a program which can juggle multiple
  string segments all in different encodings simultaneously.  How many
  programs do you write like that? And if you do, can't you just have
  a wrapper object which holds the string and its encoding?

* It's pretty much obsolete, given that the whole world is moving to UTF-8
  anyway.  All a programming language needs is to let you handle UTF-8 and
  binary data, and for non-UTF-8 data you can transcode at the boundary. 
  For stateful encodings you have to do this anyway.

* It's half-baked. You can't convert between uppercase and lowercase,
  and you can't compare strings using UCA. So anyone doing serious
  Unicode stuff is still going to need an external library.
  
* It's ill-conceived. Knowing the encoding is sufficient to pick characters
  out of a string, but other operations (such as collation) depend on the
  locale.  And in any case, the encoding and/or locale information is often
  carried out-of-band (think: HTTP; MIME E-mail; ASN1 tags), or within the
  string content (think: <?xml charset?>)

* It's too stateful. If someone passes you a string, and you need to make
  it compatible with some other string (e.g. to concatenate it), then you
  need to force it's encoding. That's impolite to the caller, as you've
  mutated the object they passed; furthermore, it won't work at all if they
  passed you a frozen string. So to do this properly, you really have to
  dup the string you're being passed, which needlessly copies the entire
  content.

  # ruby 1.8
  def append(str)
    @buf << str
  end
  
  # ruby 1.9
  def append(str)
    @buf << str.dup.force_encoding("ASCII-8BIT")
  end

However I am quite possibly alone in my opinion.  Whenever this pops up on
ruby-talk, and I speak out against it, there are two or three others who
speak out equally vociferously in favour.  They tell me I am doing the
community a disservice by warning people away from 1.9.  The remainder are
silent, apart from the occasional comment along the lines of "I wish this
encoding stuff was optional."

I will now try very hard to find something positive to say about all this.

* You can write programs to truncate a string to N characters, e.g.

  if str.size > 50
    str = str[0,47] + "..."
  end
  
  I can only think of one occasion where I've ever had to do this. Maybe
  other people do this all the time.

* You can write regular expressions to match against UTF-8 strings.  Of
  course, ruby 1.8 can do that, by the much simpler approach of tagging the
  regexp as UTF-8, rather than every other string object in the system.

* I can see how it might appeal to be able to write programs in non-Roman
  scripts. Howver this is rather defeated by the fact that constants
  must start with a capital 'A' to 'Z'.

* Erm, that's all I can think of at the moment.