Character encoding woes breaking RSS

Posted Oct 24th, 2008 by David Calhoun in php, rss

This is a relatively easy fix, but it seems to be a pretty common problem.  It doesn’t seem like there’s a lot of understanding about character sets.  Or maybe that’s just my own lack of understanding  :P

The Problem

In any case - here’s the idea: I have a simple database setup for my WWI Flight Sim site.  Every time I post a news article, the article text gets run through PHP’s addslashes() and htmlentities().  After it’s inserted into the database, my site updates my frontpage cache (a static html file of the index on my site), and it also updates my sitemap and RSS feed.

The problem is that sometimes strange characters will show up in my RSS feed.  No worries, I figured, until these characters, particularly & actually broke the RSS feed entirely.  The result was that the source code of the feed could be read, but only a few RSS entries would be displayed.  The entires would stop displaying when they encountered what it considered a horrendous character - &!  Oh noes!

The solution?  Brush up on character encoding and figure out what exactly is happening.

The Solution!

As it turns out, by default htmlentities() converts to ISO-8859-1.  My RSS feed is in UTF-8.  Trouble ensues! At this point if I were really lazy I could just change my RSS feed to use ISO-8859-1 encoding.  But I like UTF-8.  So somehow we need to get the news article into UTF-8 encoding.

And we can do that in one line with the built-in PHP function mb_convert_encoding():

mb_convert_encoding($text, "UTF-8", "HTML-ENTITIES");

Where HTML-ENTITIES is the format I’m converting from, UTF-8 is the format I’m converting to, and $text is the text I want to encode.

Read More

Named HTML Entities in RSS

PHP: How to convert ISO character (HTMLEntities) to UTF-8?

Trackback URI | Comments RSS

Leave a Reply

Categories