Evergreen

Bug #1251394
Comment #9

Comment 9 for bug 1251394

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-18:

Bill, I think this is certainly an area of need, and the direction you have chosen is definitely viable. I have concerns about continually increasing disk space requirements, but if this ultimately does wholly replace reporter.material_simple_record, that problem is ameliorated.

That said (and I am only raising this point for discussion), I wonder if we should be thinking bigger. In some parts of the code, we transform to MODS, which is often only sort of what we need, then try to further manipulate the MODS output (through normalizers or other stock functions) to get a final result. In other places (especially the OPAC), we have given up on MODS, and instead deal with the MARC directly.

This approach has some known weaknesses:
1) Duplication of logic in multiple places (reporter, MVR, OPAC)
2) Inconsistency when new logic is not propagated completely
3) Not every tool is equally suited to manipulate XML/MARC data

While it has come up a few times in various past discussions, I would like to propose that we reconsider adopting a custom XSL approach, which would mean at least one and perhaps many XSL stylesheets dedicated to MARC data extraction, for indexing, display, or either/both.

In this case, I am imagining that the data we would want in various client interfaces is exactly what is already in the results list of the OPAC. It would therefore seem very beneficial for the data extraction and selection logic to be consolidated into a single place. And, while not a holy grail, XSL happens to be really good at transforming XML data, especially when speed is a factor.

In some ways, we are already doing this, as our MODS XSL slowly drifts away from the original standard version. What I am asking is that we be more explicit about it, and also that perhaps we also separate some logic into some more special-purpose sheets. I know there will be some concerns about multiple transforms, but the cost of the transforms are very cheap compared to many things we already do, and in fact are even cheaper when the XSL is trim and focused. In a basic test on my aging DB server, a simplified in-DB transform which did a MODS-style title and author extraction only went at a clip of around 1,000 records per second.

Again, this is all just for discussion, as I haven't done enough testing (or thinking) to wholesale endorse any of this. Also, I know there are plenty of in-between approaches, and I recognize that this proposal isn't even necessarily in conflict with what we are doing here, but in fact could be done in a more or less complementary way. Still, it appears we find ourselves defining best practices for display of fundamental things like 'title' and 'author', so I think we should consider moving that discussion up (down?) a level into the MARCXML transformation logic itself.

Thoughts?

Bill, I think this is certainly an area of need, and the direction you have chosen is definitely viable.  I have concerns about continually increasing disk space requirements, but if this ultimately does wholly replace reporter.material_simple_record, that problem is ameliorated.

That said (and I am only raising this point for discussion), I wonder if we should be thinking bigger.  In some parts of the code, we transform to MODS, which is often only sort of what we need, then try to further manipulate the MODS output (through normalizers or other stock functions) to get a final result.  In other places (especially the OPAC), we have given up on MODS, and instead deal with the MARC directly.

In this case, I am imagining that the data we would want in various client interfaces is exactly what is already in the results list of the OPAC.  It would therefore seem very beneficial for the data extraction and selection logic to be consolidated into a single place.  And, while not a holy grail, XSL happens to be really good at transforming XML data, especially when speed is a factor.

In some ways, we are already doing this, as our MODS XSL slowly drifts away from the original standard version.  What I am asking is that we be more explicit about it, and also that perhaps we also separate some logic into some more special-purpose sheets.  I know there will be some concerns about multiple transforms, but the cost of the transforms are very cheap compared to many things we already do, and in fact are even cheaper when the XSL is trim and focused.  In a basic test on my aging DB server, a simplified in-DB transform which did a MODS-style title and author extraction only went at a clip of around 1,000 records per second.

Again, this is all just for discussion, as I haven't done enough testing (or thinking) to wholesale endorse any of this.  Also, I know there are plenty of in-between approaches, and I recognize that this proposal isn't even necessarily in conflict with what we are doing here, but in fact could be done in a more or less complementary way.  Still, it appears we find ourselves defining best practices for display of fundamental things like 'title' and 'author', so I think we should consider moving that discussion up (down?) a level into the MARCXML transformation logic itself.

Thoughts?