Archive for the ‘XML’ category

MySQL conference – xaware session

April 15th, 2008

Xaware.org looks prett slick.  I’m in a session with one of the xaware guys going through the process of using xaware.

OK – we had a long intro, but are finally seeing some screenshots, which look useful.  Xaware will bring value to scenarios with many complex XML documents.  The fewer or less complex your XML files are, the less useful Xaware will be (from the presenter).

I’m not 100% sure why this is necessarily more useful than accessing data directly via XML in Groovy, for example (println results.users.user.name.first for example).  Certainly having a graphic tool to manage the mappings would be useful, but I’m not sure this is a requirement.  Well, nothing is a requirement, I guess.  I think 3-4 years ago I’d have jumped all over this, but I’m not sure it’s targetted at me now.  However, I’m not working on projects with numerous disparate XML data sources.  If I was, perhaps I might see more use.

I wonder if this is 2 way?  Can I publish data to a java data source (exposed via xaware) that would WRITE or MODIFY XML for me?  Don’t know yet.

The demo has been interesting, but the session as a whole didn’t quite match up to the session name = “taming your messy mashups”.  I was expecting more code, not a tool.  It’s not bad, but not what I was expecting.   Watching a visual mapping between COBOL file data and XML structure was interesting.

Still, xaware.org might be a tool you’d be interested in using for XML mapping and reading.

Grails and Amazon web services – not possible?

October 6th, 2007

As of this writing, I’m having a devil of a time getting Grails (0.6 and 1.0-RC1 SNAPSHOT) to interact with the Amazon web services, uhh, service.  After messing around trying to use the GroovySOAP stuff, which I couldn’t get to work (maybe I’ll go back to that?) I found the Amazon Web Services library for Java.  “OK,” I thought.  “I’ll just drop that in to the /lib directory, import the com.amazonws stuff and be on my way.”  But it wasn’t to be.

I’m stuck on a seemingly insurmountable error.  Apparently only one other person on the planet is having this problem, which doesn’t help me at all.  I’m using Linux, he’s using Windows, both using JDK1.6 versions.  I’ve tried Grails 0.6 and 1.0-RC1, both with groovy 1.0 and groovy 1.1 beta 3.  No luck in any case.  Here’s the error text I’m getting back:

Message: loader constraint violation: when resolving interface method “javax.xml.bind.Unmarshaller.unmarshal(Ljavax/xml/transform/Source;)Ljava/lang/Object;” the class loader (instance of org/codehaus/groovy/tools/RootLoader) of the current class, com/amazonaws/ecs/query/AmazonECSQuery, and the class loader (instance of <bootloader>) for resolved class, javax/xml/bind/Unmarshaller, have different Class objects for the type javax/xml/transform/Source used in the signature
Caused by: java.lang.LinkageError: loader constraint violation: when resolving interface method “javax.xml.bind.Unmarshaller.unmarshal(Ljavax/xml/transform/Source;)Ljava/lang/Object;” the class loader (instance of org/codehaus/groovy/tools/RootLoader) of the current class, com/amazonaws/ecs/query/AmazonECSQuery, and the class loader (instance of <bootloader>) for resolved class, javax/xml/bind/Unmarshaller, have different Class objects for the type javax/xml/transform/Source used in the signature 

My limited reading of this and the little source code I could track down leads me to believe this is wholly insurmountable with the current setup.  By that I mean this is not a configuration thing, but is an issue deep within the core of either groovy, grails or the Amazon system.  I suspect that the groovy or grails team would need to budge on this if it’s to be addressed.  I don’t see Amazon making accomodations for the few edge cases out there running Grails.  So much for the notion that you can seamlessly drop down to Java whenever you need to.  :(

DRM is good and necessary

August 12th, 2007

for the social web to evolve to the next level.  Is that at all controversial?  I hope at least the title is, as I’d like to provoke a bit of thought in you, the reader, about the topic of DRM.

I’ve been mulling this and related topics for some time, but not quite in these words.  This morning the connection between what I’ve been thinking about and what’s commonly known as “DRM” jumped out at me, and I wanted to elaborate a bit more.  This is intended both to help me flesh out my thinking on this as well as perhaps get some feedback from the community.

I’ve always been afraid to put too much on line, especially in this blog.  Once I started publishing anything online, I was very, perhaps overly, aware of the possibility of anyone reading it.  Issues like looking for a new job were things that I couldn’t write about because my coworkers might read about it.  Financial issues were not something I could write about because, well, they tend to be somewhat personal.  Family and health issues were also pretty much off the table.  While I would have benefited from writing about each topic, writing them all with the ‘same’ identity would have made too much information about me available to too many people.

Keeping separate blogs with different identities is one way of coping with this multiple identity issue.  Using separate user accounts and participating in different forums is another way.  Both have their drawbacks – the complexity and confusion of having to use multiple systems are primary concerns, but I’m sure you can think of some other wrinkles in there as well.

This got me to thinking about the control we have over our own content on the internet.  The current model is that end users contribute actual content – text, images, video, etc. – to discrete servers under our chosen identities.  These central services act as aggregators of the content.  Once something is out there, it’s out there.  There are certain barriers which can be put up which will prevent people from accessing some of that content – forums can be closed or access-limited, for example.  We’ve still no good way to create content and control its distribution at a granular level, nor any way to revoke content once its been published.

I realize many people will continue to have this view that “it’s the internet, if you publish it, it’s out there forever”.  Google’s cache, archive.org and other developments have ingrained this “write once, live with it forever” attitude in an entire generation of people.  I’m not suggesting that those services are a bad thing, or that the concept of content being around “forever” is necessarily bad either.  I *am* suggesting that some information shouldn’t fall under that umbrella – content has different meaning based on who is writing it, who the intended audience is, who the actual audience is, and so on.   I am also suggesting that the concept of centralized ‘one time’ publishing and archiving of information is something which is having a suppressing effect on the amount of content created, shared and consumed on the internet.

What are some of the controls that we can exert over our information as its published right now?  Consider a tech geek who runs their own blog or community on their own server.  This is someone who embodies all that is possible in terms of ‘control’ over their own information on the internet.  This person can choose to make their information available to the public at large, or only to a select group of people, via registration/invitation.  If the information is to the public at large, a ‘robots.txt’ file is available to let well-behaved search engine crawlers know what they can index (ignoring the non-well-behaved for this discussion).  Once it’s indexed, our hero has a devil of a time getting it ‘unindexed’.  Google has an ‘immediate’ page removal tool, but that is something which still operates on pages.  You need to serve up a 404 page for the googlebot, but keep the page ‘open’ to the rest of your visitors if your intention was to truly ‘unindex’ the URL, rather than remove it.   How or if other search crawlers offer these sorts of services is beyond the scope of this post.  The point I’m trying to make is that it’s rather difficult and complicated, and that’s for people who have control over their entire publishing mechanism.

For people who simply post in hosted content services (blogs, forums, etc.) the control over content is extremely limited.  That’s been the nature of the beast so far, and it’s worked reasonably well, but there seems to be quite a lot lacking in my own ability to control what I’ve said and where it’s been republished/syndicated/etc.  Perhaps the ‘what I’ve said’ issue shouldn’t be able to be modified.  After all, even in the real world, rarely do we let people go back and revise their content (excepting George Lucas’ ability to revise  “Star Wars” ad infinitum).   But who the content gets distributed to, and perhaps how much of that content they receive, is something we’ve had more experience with over the past several years, primarily in the music and movie arena.

The notion of DRM – Digital Rights Management – software controlling what you can and can’t do with something received (usually purchased) isn’t really all that new.  Back in my day, C64 disks were ‘copy protected’.  If you used the product as intended, it worked.  If you tried to use a generic disk backup utility, the drive knocked about, (and could break) because the publisher had modified the disk format such that ‘ordinary’ utilities couldn’t read the disk contents, which would prohibit copies.  Mr. Nibble got around this by writing new disk copy programs which bypassed that built-in reading, and then publishers pushed back with even harder-to-crack protection.  This arms race eventually subsided, and copy protection, at least at the hardware level, seemed to subside for awhile.

But it’s come back with a vengence, and the stakes are much higher.  Copy protection – DRM – is a basic part of how most music and videos are distributed.  The software players will decode the bits and give you the music only if conditions embedded in the music directly ‘allow’ the player to do so.  Have you paid your license this month for your Yahoo! music subscription?  If not, your player won’t play.  Time-limited DRM is big with Yahoo and Microsoft, who offer ‘all you can eat’ subscription pricing.  Apple’s DRM is not time-sensitive, but hardware sensitive.  Your purchased tracks can only be transferred to X number of computers, and you can only burn a track collection Y number of times.  These limits are high enough that most people aren’t affected with average use, just like the monthly pricing is set low enough to not be a burden to most people.  But the concept is still in there – the content owner still has a say in how you use the content, and they have technical means to prevent you from taking certain actions.

Contrast this with content you create and publish on the web in the form of images, music, videos and text.  The average user has no control over how their information is used once it’s “out there”.  Yes, we have copyright laws, but tracking down violators and enforcing the laws is often not worth the effort, mostly because the effort is so time consuming.

There’s been a move to incorporating restrictions in content creation tools, albeit at a somewhat coarse level, in neworks  like facebook.   Facebook has the idea of controlling which pieces of information are shown to specific sets of people (‘my friends’, ‘my groups’, etc.).  While this idea is a step in the right direction, it’s nowhere near as fine-grained as it should or could be.

As I’ve been writing this entry, I’ve stopped a few times (errands to run and such), and already my thinking has changed a bit since this morning’s view.  What I’m now envisioning is content creation that would allow marking up various segments of the content with permission levels.  Delivery of content can be handled much as most web content is delivered today.  When served up by the server, an authenticated user would get access to “extra” layers in the content.

This seems similar to the old RealPlayer idea of a stream being created once, but having multiple levels of quality built in to it – the player and server negotiated the level of quality, and the server would serve up the higher quality sections of the file if the player could handle it.  If not, the lower quality portions of the file were streamed down. This wouldn’t necessarily work in a world where people access most data directly (or, with only one layer of software in between – the general purpose browser).  My scheme would require an extra or different layer of software to request the content with the necessary authentication protocol in place.  I’m envisioning this being handled more between agents on behalf of users – perhaps the next generation of RSS readers with identity management built in.  Ideally the software would also respect caching and timeout headers, to help deal with ‘clearing’ out of content which the original author no longer wants around.  I completely understand that something like this depends on the receiving software honoring that sort of request, and it could just as easily ignore it.  Once you have the content, you have the content, right?  While technically true, our general web browsers have the notion of content caching built in, and we don’t generally worry about that too much.  Nothing will give total control, but a decent balance between the wishes of the author and the desires of the consumer would be more closely achieved with this sort of approach.

So, after another half hour or so away from this, this idea is turning in to more of a wish for three things:

  • Multi-layered content creation tools which respect identity levels
  • Identity authentication and negotiation at the content serving level
  • Identity management and negotiation at the content consuming level (RSS readers would be a good start)

OK, so it’s not *necessary*, but would certainly be useful. For the identity negotiation aspect to work, I’m thinking that the openid project has a good approach, and incorporating that openid practice would be a good direction to head in.

When an agent requests a piece of content, the server response can include embedded information which indicates a more complete version is available, with links to request the more complete version(s).  Any request for this information would require authentication (via openid).  During this authentication process, if the user/agent is unknown, the original author would be notified of a pending request, the requestees information, and the option to grant access to the information or not.

As I explore this more, I’m more conflicted.  On one hand, it sounds plausible, and possibly doable were this to be integrated in to  some key communication tools (facebook, wordpress, myspace, etc.).  However, it’s complex.  It’s complex to implement and complex to think about.  Complexity rarely wins out over the simple on the internet.  In other ways it may be a solution in search of a problem.  Well, *I’ve* found it a problem – content creation and distribution with different sections of content intended for different audiences.  Has anyone else found the problem of multiple identities and multiple audiences to be enough of a problem to contemplate these sorts of measures?  Or am I just barking up the wrong tree?  Or just simply barking, as my wife suggests?

Thinqing of linqing

July 9th, 2007

I’ve been digging more in to groovy and grails lately – done more than toe-dipping but nothing I want to shout about yet.  The more I tested the collections and looping – syntax like doc.entry.findAllBy() stuff – I remembered linq (or is it LINQ?).  This is data access technology that will be central to Microsoft’s upcoming C#3.0 release.  I won’t pretend to be able to discuss it intelligently here – I don’t do much in the MS world right now.  However, I did see a demo of it this past January.  And so I started reading up on it some more, and found a number of Java-er reactions.  One in particular stuck out, and I believe this was an MSDN followup.  Both are a year old, but highlight what will likely be a big shift in the next few years.  Or not.  It’s easy to sit on the fence, but I do think that ultimately MS will have another big impact on how developers think about data access with LINQ.

One of the interesting comments in one of those posts was about how the ‘dynamic’ language crowd (I think he was particular talking about Ruby advocates) states that you have to have dynamic languages to provide some of the more flexible bits of functionality they provide.  Probably in large part that’s been true, although groovy/jruby/ironpython and other ‘dynamic’ languages running on ‘static’ language VMs point out that it’s not a 100% requirement.  If MS can pull off making co-existing hybrid static/dynamic typing first-class, which LINQ seems to do, they’ll have another leg up on competing players for the next several years.

These sorts of things needs to be added at the system core.  MS can do it with .NET.  Sun (basically) has to do it with the JVM.  Given that Sun has a public perception of dragging their feet on the language, perhaps the recent open sourcing of Java will give some other larger players the ability to add these sorts of new ideas to Java.  But just ‘adding’ stuff in doesn’t necessarily mean it’ll be integrated, or that there will be interesting tech to take advantage of it (though having the foundation would be a start).

I’m simply an interested bystander – I don’t have any particular horse in this race, but it’s certainly an interesting time to be watching these technologies evolve.

(Bonus points for anyone who got the Beatles reference in this post).

PHP4->5 XML wrapper

May 28th, 2007

I’ve written here before about the pain that is migrating PHP4 apps which use DOMXML to PHP5.  I found a wrapper system, just today, at http://alexandre.alapetite.net/doc-alex/domxml-php4-php5/.  I know I’ve looked for this sort of thing before, and I’m not sure why I didn’t find it earlier, but it’s just made a tedious task take about 2 minutes.  Thanks guys!

While I’m sure it’s not perfect, it seems to be working well enough for what I needed it to do – basically read data from an XML file in to an array structure.  Doesn’t simplexml do that already?  Sort of, but I didn’t want to have to go revisit all the code – this wrapper recreates the missing functions giving transparent backwards compatibility.  This is the sort of this PHP5 should have shipped with from day one, along with the ability to run both 4 and 5 together as Apache modules side by side.

Social network mashup v2.0

May 23rd, 2007

Friend Joe Stump recently (as in today, I think) launched correlate.us, a feed aggregator which brings together your personal feeds from delicious, twitter, flickr and digg.  I believe more feed streams will be forthcoming, but for now it’s hitting some of the larger players.  In a nutshell, it’s a way to see not only tagged items from a particular site (like you can with delicious) but also items with the same tags from other networks.  On delicious, I can see what items are tagged with ‘beatles’, but I can’t see items tagged ‘beatles’ from flickr at the same time, nor can I see what items a particular user has tagged across multiple networks.  It’s a very interesting (yet seemingly obvious) mashup, and I’m looking forward to seeing what comes of this project.  Good luck Joe!

rss/ical combination

January 2nd, 2007

I’ve not seen any signs we’re quite there yet, though searching for “ical/ics” and “rss enclosure” does bring up some interesting ideas. In short, what I am hoping to see is something like the following:

When I’m authoring a blog entry, I can add specific event information (date/time/location/etc) which gets added to the RSS feed for that blog. The data could be in its own format, but it likely makes more sense to use the ical format. Whether this is embedded directly in the RSS or a reference to an automatically created ics file doesn’t really matter to me – both could be fine.

Feeds carrying references to audio files became known as ‘podcasts’. While the main focus – the audio file – was intended for media players, the other data in the RSS stream provides valuable context and can be used on its own in feed readers. While many podcast feeds are solely the audio file, beyond a title and a short description, that’s just how it is now, not how it has to be. All the data in the RSS feed can be complimentary, and in my view, event information is perfectly suited for some form of standardized inclusion process in RSS feeds.

I know that a lot of people ‘subscribe’ to ICS files directly, but I’m not sure that’s the best way to go about it. My biggest concern is that there’s not enough meta-data around the event info. In some cases there doesn’t need to be much, but in other cases, more data would help. By my reading of ICS, it’s not XML based, and there are defined standards for what fields are to be included. I imagine it degrades gracefully, ignoring unknown keywords, but something still feels very utilitarian about it (not necessarily a bad thing, but doesn’t seem to leave room for much innovation).

EXAMPLE: In a blog post, I can add a long description of the event, links to more info, such as maps, etc. Perhaps even include an audio or video file to describe the event in more detail. Current readers can be extended to incorporate this new info whenever, adding new functionality. Imagine the google rss reader offering to add event information from the feed items you’re reading directly in to your google calendar. While I’m not the biggest google fan, they’re a convenient example because they own both components right now – I can’t think of too many companies that do (does MS? I’d think so).

I’m probably not putting this very eloquently, but I hope I get the idea across. This seems like it’s something that needs to come from one of the big players, or perhaps a consortium, to get momentum, otherwise it’s just someone else’s half-baked code idea :)

Addition:

I think the bigger hurdle here is adoption on the reader side (is it that obvious?). If someone like feedburner would make the option to ‘add to ical, add to yahoo cal, add to outlook’, etc in their feed processing for ical data associated with a blog post, that would probably ensure adoption right there.

Simplepie

December 1st, 2006

I’ve been meaning to write about simplepie.org for awhile, but haven’t got around to it (mentioned in my podcast, thought).  If you’re using PHP to do RSS parsing, use simplepie.  It’s that easy of a decision to make right now.  I did find a dissenting voice, claiming that simplepie was way slower than magpie – I didn’t find that to be the case.  I will likely be converting pfblogs.com to use simplepie in the coming weeks, as it greatly simplifies RSS parsing, deals with timezones, differing RSS/Atom tags, and a host of other niceties that magpie doesn’t do.

SOAP discussion – “S” used to mean “Simple”

November 27th, 2006

I’m not sure there’s much more to say about this, except that it sums up much of my own observations about how SOAP has evolved over the years.  The “S” really did stand for “Simple” at one point, but AFAICT, since the 1.2 spec, SOAP is not officially an acronym for anything any more.

SOAP discussion – “S” used to mean “Simple”

November 27th, 2006

I’m not sure there’s much more to say about this, except that it sums up much of my own observations about how SOAP has evolved over the years.  The “S” really did stand for “Simple” at one point, but AFAICT, since the 1.2 spec, SOAP is not officially an acronym for anything any more.