Raising URI::InvalidURIError from a perfectly valid URI

I was puzzled by URI::parse raising an URI::InvalidURIError on a perfectly well formed URI recently.

[ruby]
URI::InvalidURIError: bad URI(is not URI?): http://practicalguile.com/articles?query=latest
from /opt/local/lib/ruby/1.8/uri/common.rb:436:in `split’
from /opt/local/lib/ruby/1.8/uri/common.rb:485:in `parse’
from (irb):2
from :0
[/ruby]

What’s not apparent in this exception message is that the url contained a trailing space and this was causing URI.parse to fail. The following specifications demonstrate how it can trigger this particular exception.

uri.spec.rb
[ruby]
require ‘rubygems’
require ‘spec’
require ‘uri’

describe URI do
it “should raise an InvalidURIException with leading whitespace in url” do
lambda{ URI.parse(‘ http://www.ruby-lang.org’) }.should raise_error(URI::InvalidURIError)
end

it “should raise an InvalidURIException with trailing whitespace in url” do
lambda{ URI.parse(‘http://www.ruby-lang.org ‘) }.should raise_error(URI::InvalidURIError)
end
end
[/ruby]

Running the spec will get you the result below.

ruby uri.spec.rb

..Finished in 0.030051 seconds

2 examples, 0 failures

Looking at the stacktrace in the exception, it’s being raised by URI.split after URI.parse is invoked with the offending URL.

RUBY_INSTALL/1.8/uri/common.rb

[ruby]
def self.parse(uri)
scheme, userinfo, host, port,
registry, path, opaque, query, fragment = self.split(uri)

if scheme && @@schemes.include?(scheme.upcase)
@@schemes[scheme.upcase].new(scheme, userinfo, host, port,
registry, path, opaque, query,
fragment)
else
Generic.new(scheme, userinfo, host, port,
registry, path, opaque, query,
fragment)
end
end
[/ruby]

Nothing weird happening in URI.parse, its a straightforward call to URI.split. So I’ll go into URI.split, comments removed for brevity.

[ruby]
def self.split(uri)
case uri
when ”
when ABS_URI
scheme, opaque, userinfo, host, port,
registry, path, query, fragment = $~[1..-1]

if !scheme
raise InvalidURIError,
“bad URI(absolute but no scheme): #{uri}”
end
if !opaque && (!path && (!host && !registry))
raise InvalidURIError,
“bad URI(absolute but no path): #{uri}”
end
when REL_URI
scheme = nil
opaque = nil

userinfo, host, port, registry,
rel_segment, abs_path, query, fragment = $~[1..-1]
if rel_segment && abs_path
path = rel_segment + abs_path
elsif rel_segment
path = rel_segment
elsif abs_path
path = abs_path
end
else
raise InvalidURIError, “bad URI(is not URI?): #{uri}”
end

path = ” if !path && !opaque # (see RFC2396 Section 5.2)
ret = [
scheme,
userinfo, host, port, # X
registry, # X
path, # Y
opaque, # Y
query,
fragment
]
return ret
end
[/ruby]

URI.split is matching the incoming url with an empty string as well as regular expressions for absolute and relative URIs. It’s obvious from the specifications earlier that urls with leading/trailing whitespace do not match any of these and the case statement raises InvalidURIError, with the rather misleading message.

The regexes used for matching absolute and relative URIs is shown below, if you really want to know.
[ruby]
require ‘uri’
include URI::REGEXP

ABS_URI
/^
([a-zA-Z][-+.a-zA-Zd]*): (?# 1: scheme)
(?:
((?:[-_.!~*'()a-zA-Zd;?:@&=+$,]|%[a-fA-Fd]{2})(?:[-_.!~*'()a-zA-Zd;/?:@&=+$,[]]|%[a-fA-Fd]{2})*) (?# 2: opaque)
|
(?:(?:
//(?:
(?:(?:((?:[-_.!~*'()a-zA-Zd;:&=+$,]|%[a-fA-Fd]{2})*)@)? (?# 3: userinfo)
(?:((?:(?:(?:[a-zA-Zd](?:[-a-zA-Zd]*[a-zA-Zd])?).)*(?:[a-zA-Z](?:[-a-zA-Zd]*[a-zA-Zd])?).?|d{1,3}.d{1,3}.d{1,3}.d{1,3}|[(?:(?:[a-fA-Fd]{1,4}:)*(?:[a-fA-Fd]{1,4}|d{1,3}.d{1,3}.d{1,3}.d{1,3})|(?:(?:[a-fA-Fd]{1,4}:)*[a-fA-Fd]{1,4})?::(?:(?:[a-fA-Fd]{1,4}:)*(?:[a-fA-Fd]{1,4}|d{1,3}.d{1,3}.d{1,3}.d{1,3}))?)]))(?::(d*))?))?(?# 4: host, 5: port) |
((?:[-_.!~*'()a-zA-Zd$,;+@&=+]|%[a-fA-Fd]{2})+) (?# 6: registry)
)
|
(?!//)) (?# XXX: ‘//’ is the mark for hostport)
(/(?:[-_.!~*'()a-zA-Zd:@&=+$,]|%[a-fA-Fd]{2})*(?:;(?:[-_.!~*'()a-zA-Zd:@&=+$,]|%[a-fA-Fd]{2})*)*(?:/(?:[-_.!~*'()a-zA-Zd:@&=+$,]|%[a-fA-Fd]{2})*(?:;(?:[-_.!~*'()a-zA-Zd:@&=+$,]|%[a-fA-Fd]{2})*)*)*)? (?# 7: path)
)(?:?((?:[-_.!~*'()a-zA-Zd;/?:@&=+$,[]]|%[a-fA-Fd]{2})*))? (?# 8: query)
)
(?:#((?:[-_.!~*'()a-zA-Zd;/?:@&=+$,[]]|%[a-fA-Fd]{2})*))? (?# 9: fragment)
$/xn

REL_URI
/^
(?:
(?:
//
(?:
(?:((?:[-_.!~*'()a-zA-Zd;:&=+$,]|%[a-fA-Fd]{2})*)@)? (?# 1: userinfo)
((?:(?:(?:[a-zA-Zd](?:[-a-zA-Zd]*[a-zA-Zd])?).)*(?:[a-zA-Z](?:[-a-zA-Zd]*[a-zA-Zd])?).?|d{1,3}.d{1,3}.d{1,3}.d{1,3}|[(?:(?:[a-fA-Fd]{1,4}:)*(?:[a-fA-Fd]{1,4}|d{1,3}.d{1,3}.d{1,3}.d{1,3})|(?:(?:[a-fA-Fd]{1,4}:)*[a-fA-Fd]{1,4})?::(?:(?:[a-fA-Fd]{1,4}:)*(?:[a-fA-Fd]{1,4}|d{1,3}.d{1,3}.d{1,3}.d{1,3}))?)]))?(?::(d*))? (?# 2: host, 3: port)
|
((?:[-_.!~*'()a-zA-Zd$,;+@&=+]|%[a-fA-Fd]{2})+) (?# 4: registry)
)
)
|
((?:[-_.!~*'()a-zA-Zd;@&=+$,]|%[a-fA-Fd]{2})+) (?# 5: rel_segment)
)?
(/(?:[-_.!~*'()a-zA-Zd:@&=+$,]|%[a-fA-Fd]{2})*(?:;(?:[-_.!~*'()a-zA-Zd:@&=+$,]|%[a-fA-Fd]{2})*)*(?:/(?:[-_.!~*'()a-zA-Zd:@&=+$,]|%[a-fA-Fd]{2})*(?:;(?:[-_.!~*'()a-zA-Zd:@&=+$,]|%[a-fA-Fd]{2})*)*)*)? (?# 6: abs_path)
(?:?((?:[-_.!~*'()a-zA-Zd;/?:@&=+$,[]]|%[a-fA-Fd]{2})*))? (?# 7: query)
(?:#((?:[-_.!~*'()a-zA-Zd;/?:@&=+$,[]]|%[a-fA-Fd]{2})*))? (?# 8: fragment)
$/xn
[/ruby]

Looks rather intimidating, doesn’t it? However, we’re more interested in the beginning and end of the regular expressions so its safe to ignore all the stuff in between. Narrowing our focus down to the regex anchors (^ and $), we can see that there is no matching of whitespace, thus preventing a valid URI from being matched in URI.split.

This all means that URI.split has a undocumented pre-condition on the uri parameter being stripped of any whitespace around it.

Advertisements

3 thoughts on “Raising URI::InvalidURIError from a perfectly valid URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s