txt2re: Tool to relieve headaches caused by Regexp (Regular Expressions)

Posted Nov 25th, 2008 by David Calhoun in javascript, regular expressions2 responses so far

Since I’m in the middle of transitioning from beginner Javascript to what I consider intermediate Javascript, I’ve often been faced with the intimidating task of using Regular Expressions.

It’s no easy task.  The syntax is ridiculously difficult to read and probably harder to write, at least for the uninitiated (i.e. normal people like me).  Javascript guru Douglas Crockford agrees, as can be seen in part 4 of his video “The JavaScript Programming Language” and also in Chapter 7 of his book “Javascript: The Good Parts”.

A LOT of people, including myself, learn best by doing.  That means a lot of trial-and-error.  And a good way to learn by doing is by using the great online tool txt2re.

The interface has a lot to be desired, but it’s still very useful.  For example, I can type in the following string into a textbox on the site: Example “text”!

And I get this kind of bewildering and colorful output:

Notice my original text at the very top.  It took me a while to see it (due to the bad UI), but once I saw it, it started to make sense.  Notice that each character is contained in its own colored box.  By clicking the links below the characters, you can filter out specific stuff.  For instance, I can click on the quotation mark link ” below each quotation mark to filter.

txt2re will take my two new rules to filter out the quotation marks and spit back the Regex code in whatever code I so desire: Perl, PHP, Python, Java, Javascript, ColdFusion, C, C++, Ruby, VB, VBScript, J#.net, C#.net, C++.net, or VB.net.

I happen to only be interested in Javascript at the moment, so I click on the Javascript tab and get my code:

var txt='Example "text"!';

var re1='.*?';	// Non-greedy match on filler
var re2='(")';	// Any Single Character 1
var re3='.*?';	// Non-greedy match on filler
var re4='(")';	// Any Single Character 2

var p = new RegExp(re1+re2+re3+re4,["i"]);
var m = p.exec(txt);
if (m.length>0)
{
var c1=m[1];
var c2=m[2];
document.write("("+c1.replace(/<!--,"<")+")"+"("+c2.replace(/</,"<")+")"+"\n");<br /--> }

One of the things I especially love about this is that the Regexp doesn’t combine the expression on one line, making it infinitely harder to decode.  Instead it takes each rule you specified and leaves them separated (notice the re1, re2, re3, re4, in the code).

By leaving them separated AND commented, you can, in theory, go back later and manually change your Regexp if necessary.  Intuitively, keeping the rules separated increases maintainability.

Or if you want, you can put all the rules on one line.  Just try to maintain and modify this code without having it break!

Image courtesy of cackhanded, from Douglas Crockford’s presentation “The JavaScript Programming Language” (Part 4)

Video: Douglas Crockford, “Web Forward”

Posted Nov 18th, 2008 by David Calhoun in css, html, javascriptNo responses yet
Check this out!  I was there!  I’m the guy fiddling around with the camera in the front row, half paying attention  :)

Make Wordpress more difficult to spam

Posted Oct 27th, 2008 by David Calhoun in wordpressNo responses yet

Spam spam spam and bacon spam spam and eggs spam spam spam and sausage spam

Spam spam spam and bacon spam spam and eggs spam spam spam and sausage spam

If you’ve ever had a blog, you likely know the pain and frustration of dealing with comment spammers. Especially if you have a Wordpress blog.  All of the following is designed to help you out with that.

History

I’ve had a Wordpress blog since 2004, and I found out pretty quick that spammers flock like mad to try to post their links.  My first solution was to simply disable links entirely.  This worked at first, but then spammers came back in some shape or form and found out how to get around these measures.

So the next solution was to try to get Wordpress to block certain words in comments.  For some reason, at the time there was some loophole in the software that still enabled users to make posts with these words, so I made a bit of a hack.  The problem was that spammers would keep coming up with all sorts of crazy names of magic medical pills and the sort, and I would have to go in and update my filter list every time.

Turning on moderation was something else I tried, but spam remained a problem, since a large chunk of my time was spent looking for legit comments in the mess of spam comments.

The Solution (err, solutions)!

Akismet

Akismet: saves you from the bastard spammers!

To a large degree, your comment spam woes can be solved by the positively awesome Wordpress plugin called Akismet.  There is a bit of setup involved, as you have to register over at Wordpress.com for your API key after installing.  But it’s definitely worth it.

After finding that Akismet automatically identified 35,000+ spam comments (!) on my personal blog and only let literally one or two get through unflagged, it definitely does a great job.  But I found that the spam robots don’t know when to stop.  They simply keep trying to post comments.  As a result, though your blog may be free from spam, your server is under constant strain from comment spammers.

At this point, things start to get a bit more creative…

On the assumption that most spam bots don’t have Javascript enabled, I dynamically added a hidden form field in /wp-content/themes/default/comments.php before </form>:

<?php

echo <<< HTML
<script type="text/javascript" charset="utf-8">
    document.write("<input type=\"hidden\" name=\"spamDetect\" value=\"JS\">")
</script>
<noscript>
    Due to the bastard spammers, you must now have JavaScript enabled in order to post.  Sorry!
</noscript>
HTML;

?>

Then I add the following to the very top of /wp-comments-post.php (after the <?php of course):

if (!isset($_POST['spamDetect']) ) {
    echo "You must have Javascript enabled to post comments.";
    exit;
}

I found that this pretty much eliminated spammers from showing up in my Akismet “Spam” tab, while still allowing normal people to post, and even post links!  The drawback, of course, is that people without Javascript are completely unable to post comments.

But now my server feels much better - it no longer has to receive spam comments and inserts them into MySQL.  It simply doesn’t receive them at all.

reCAPTCHA: Since one CAPTCHA is simply just not good enough.

reCAPTCHA: Since one CAPTCHA is simply just not good enough.

An alternative solution, but one I always find annoying from the user’s perspective, is CAPTCHA, specifically reCAPTCHA, which has it own Wordpress plugin.  It also has audio CAPTCHA support for non-Javascript users, which is a bit more sensitive than my more drastic solution.

One more thing!  All this, save for Akismet, doesn’t stop trackback spam, which doesn’t use the traditional comment form.  So you’ll want to install the TrackBack Validator as well.

Summary

  1. Akismet is awesome for identifying spam.  Install it.  Use it.
  2. But spammers will still congest your server, so add more stuff to block them, like the custom code above.  And/or install reCAPTCHA.
  3. Install the TrackBack Validator to prevent comment trackback spam.

Getting RSS data is really simple…

Posted Oct 25th, 2008 by David Calhoun in rss, xml3 responses so far

As its name suggests, RSS (Really Simple Syndication) turns out to be just that - quite easy.  RSS is basically a simple XML feed, and as I have limited experience with XML, it’s always been a mystery to me how exactly you go about receiving and manipulating that data.

As it turns out, in PHP 5 it’s even easier than before.  We can essentially get RSS data (or any XML data, for that matter) with two lines of code, using functions that come with PHP by default:

$xml_string = file_get_contents('http://rss.cnn.com/rss/cnn_topstories.rss');
$xml_object = new SimpleXMLElement($xml_string);

And voila!  We now have the entire contents of the Top Stories RSS feed from CNN, all contained in $xml_object.

If we want to get the title of the top 2 stories, we simply do this:

$title1 = $rss->channel[0]->item[0]->title;
$title2 = $rss->channel[0]->item[1]->title;

The same can be done with the link and description fields, as in this example, where we grab a couple pieces of data to show information related to the top story:

$top_story = $rss->channel[0]->item[0];
$title = $top_story->title;
$link  = $top_story->link;
$description = $top_story->description;
echo <<< HTML
Top story on CNN:<br>
<a href="$link">$title</a><br>
$description<br>
HTML;

Then we have something that looks like this:

Notice that CNN puts some extra links at the bottom of the story description.

All this turns out to be ridiculously powerful, and I’m surprised I didn’t learn this sooner!

Download the Code

Download the code and example.  <- the above code put into a function called xml_to_object, with error handling and all.

Wordpress: posting beautiful code

Posted Oct 24th, 2008 by David Calhoun in UncategorizedNo responses yet

The last entry I made as some really nicely formatted code, if I do say so myself.  It’s because I’ve discovered SyntaxHighlighter by Alex Gorbatchev.

The general code can be used on any HTML and says it’s 100% JavaScript-based.  The Wordpress plugin version is based on PHP but looks exactly the same.

All you do when posting code is write your code between these tags: [sourcecode language='css']..[/sourcecode]

(Note: write out the above code, don’t copy and paste, because it preserves the formatting and therefore won’t render the code properly, which is the only way I was able to show the source code as an example here)

The result is something like this:

#button {
	font-weight: bold;
	border: 2px solid #fff;
}

(via WordPress FAQ)

Update (October 28, 2008)

Wordpress chews up and removes tabs and spaces in your code!  To prevent this, you have to click on the “HTML” tab and surround your sourcecode tags with <pre> tags BEFORE you drop in your tabbed/spaced code.  This tells Wordpress that it shouldn’t get rid of those spaces/tabs that it apparently hates so much.

Character encoding woes breaking RSS

Posted Oct 24th, 2008 by David Calhoun in php, rssNo responses yet

This is a relatively easy fix, but it seems to be a pretty common problem.  It doesn’t seem like there’s a lot of understanding about character sets.  Or maybe that’s just my own lack of understanding  :P

The Problem

In any case - here’s the idea: I have a simple database setup for my WWI Flight Sim site.  Every time I post a news article, the article text gets run through PHP’s addslashes() and htmlentities().  After it’s inserted into the database, my site updates my frontpage cache (a static html file of the index on my site), and it also updates my sitemap and RSS feed.

The problem is that sometimes strange characters will show up in my RSS feed.  No worries, I figured, until these characters, particularly &amp; actually broke the RSS feed entirely.  The result was that the source code of the feed could be read, but only a few RSS entries would be displayed.  The entires would stop displaying when they encountered what it considered a horrendous character - &amp;!  Oh noes!

The solution?  Brush up on character encoding and figure out what exactly is happening.

The Solution!

As it turns out, by default htmlentities() converts to ISO-8859-1.  My RSS feed is in UTF-8.  Trouble ensues! At this point if I were really lazy I could just change my RSS feed to use ISO-8859-1 encoding.  But I like UTF-8.  So somehow we need to get the news article into UTF-8 encoding.

And we can do that in one line with the built-in PHP function mb_convert_encoding():

mb_convert_encoding($text, "UTF-8", "HTML-ENTITIES");

Where HTML-ENTITIES is the format I’m converting from, UTF-8 is the format I’m converting to, and $text is the text I want to encode.

Read More

Named HTML Entities in RSS

PHP: How to convert ISO character (HTMLEntities) to UTF-8?

Basic HTML template

Posted Oct 23rd, 2008 by David Calhoun in htmlOne response so far

This is basically for my own use, but others might find it useful as well:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
	<title></title>

	<meta name="keywords" content="" lang="en-us" />
	<meta name="title" content="" lang="en-us" />
	<meta name="description" content="" lang="en-us" />
	<meta name="robots" content="all" />
	<meta name="author" content="" />
	<meta http-equiv="Content-Language" content="en-us" />

	<link rel="stylesheet" type="text/css" charset="UTF-8" href="" />
	<script type="text/javascript" charset="UTF-8" src="" />
</head>
<body lang="en-us">

</body>
</html>

And some rationale for ordering of elements:

  • Content-Type should be included before everything else in HEAD because this allows the browser to correctly interpret the following tags in the right encoding set
  • Title should be included right after the content-type tag, since it’s probably the most important part of the entire page.  It allows the user to determine if they want to look at the rest of the page.
  • CSS is usually more important than JavaScript, so it’s included first.  The user usually starts looking at the content first, which is styled by the CSS.  Usually JavaScript only takes in effect when the user interacts with the page.  But the user usually interacts with the page only after viewing content, so the content (HTML) and styling of that content (CSS) take precedence.

Wider margin or padding issue in IE6

Posted Oct 13th, 2008 by David Calhoun in browser bugs, cssOne response so far

Not sure why you’re getting wider margins or padding in IE6?  It’s a well-known issue with floating div elements, and it even has a specific name: the double margin (or double padding) bug.

The fix is simple.  Add this CSS rule to the floating Divs:
display: inline;

As an aside, it’s interesting to note that this fix doesn’t work if you’ve forgotten to add a unit value to your margin or padding.  For instance, if you’ve typed “padding: 10″ instead of “padding: 10px”, IE6 still doubles the 10 and presumes the units are in pixels.

Via http://www.jaymeblackmon.com/ie6-double-margin-bug-fix and                          http://www.positioniseverything.net/explorer/floatIndent.html

Centering elements with CSS

Posted Oct 13th, 2008 by David Calhoun in cssNo responses yet

If you’re just starting out with CSS, you probably run into the problem of centering elements horizontally.  The common way to center an element within its parent container is by applying this:

margin: auto;

Or if you want to be more specific:

margin-right: auto;
margin-left: auto;

(which is the same as this:)

margin: 0 auto;

Still not centered?  Make sure the element is being displayed as a block:

display: block;

After all of this, is it still not centered?  Then you have to consider your browser itself.  Are you using a non-standard or outdated browser such as IE6 or older?  If you’re viewing on IE6, then there may be the possibility that your page is being rendered in quirks mode (especially if your webpage has no DOCTYPE tag), which renders the page using IE5.  In this case, you will need to add this to correctly center the element:

text-align: center;

Yahoo Frontend Engineering Conference 2008

Posted Oct 10th, 2008 by David Calhoun in UncategorizedNo responses yet

The conference was definitely interesting and well worth the drive from Southern California!  I finally got to see a lot of familiar names in person.  Douglas Crockford was there of course, and he didn’t have to worry about making people laugh at the right time during his keynote.  When Douglas wants you to laugh, you laugh (apparently).

I would love to have presented something, but I had nothing to show off, and being an intern, I can’t do much..  I’m lucky I got to attend I suppose!  Everyone who signed up got Crockfords “Javascript: All The Good Parts” as well as a conference T-shirt and sticker!

There was also a raffle for assorted books.  I didn’t win, but James did - and he graciously gave me his books.  Both written by fellow Yahoo!’s - a book on Javascript something or other by Nicholas Zakas and another book by Christian Heilmann.  I saw Zakas give a presentation during the conference and I was actually taught one-on-one by him for a couple-day long course on the DOM and YUI, which was part of the Yahoo Juku program.

Photos to come.  Right now I’m tired from the drive.  Took off early around 3pm and got to my house at 9pm - drove straight through without a break.  And now it’s somehow 1am.

Categories