Segmentation fault on accessing CrossRef metedata

Bug #222340 reported by GadAbraham
2
Affects Status Importance Assigned to Milestone
Referencer
Fix Committed
Undecided
Unassigned

Bug Description

Hi,

First, you're doing great work on this tool, it's really useful!

I've using version 1.1.2, compiled from source on Ubuntu Gutsy amd64.

I have a CrossRef account, and when I add a PDF with DOI doi:10.1002/sim.1844, Referencer segfaults.
I noticed that the DOI in that PDF is in parentheses, and the closing parenthesis is still in the CrossRef
request, see below.

Thanks,
Gad

$ gdb /usr/local/bin/referencer
GNU gdb 6.6-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) r
Starting program: /usr/local/bin/referencer
[Thread debugging using libthread_db enabled]
[New Thread 47763704600336 (LWP 5016)]
warning: Lowest section in /usr/lib/libicudata.so.36 is .hash at 0000000000000120
main: setting PYTHONPATH to ':./plugins:/home/gad/.referencer/plugins:/usr/local/lib/referencer:'
PluginManager::scan: Found module 'lyx'
Plugin::load: No metadata capabilities in lyx
Trying /usr/local/lib/referencer/lyx.png
Plugin::load: successfully loaded 'lyx'
PluginManager::scan: Found module 'pubmed'
Plugin::load: No actions in pubmed
Plugin::load: successfully loaded 'pubmed'
[New Thread 1082132816 (LWP 5019)]
[New Thread 1090525520 (LWP 5020)]
[New Thread 1098918224 (LWP 5021)]
[New Thread 1107310928 (LWP 5022)]
[New Thread 1115703632 (LWP 5023)]
[New Thread 1124096336 (LWP 5024)]
[New Thread 1132489040 (LWP 5025)]
[New Thread 1140881744 (LWP 5026)]
[Thread 1082132816 (LWP 5019) exited]
[New Thread 1149274448 (LWP 5027)]
[New Thread 1082132816 (LWP 5028)]
[New Thread 1157667152 (LWP 5029)]
[Thread 1157667152 (LWP 5029) exited]
RefWindow::run: entering main loop
[Thread 1124096336 (LWP 5024) exited]
[Thread 1115703632 (LWP 5023) exited]
[Thread 1107310928 (LWP 5022) exited]
[Thread 1098918224 (LWP 5021) exited]
[Thread 1090525520 (LWP 5020) exited]
[Thread 1149274448 (LWP 5027) exited]
[Thread 1140881744 (LWP 5026) exited]
[Thread 1132489040 (LWP 5025) exited]
[Thread 1082132816 (LWP 5028) exited]
[New Thread 1132489040 (LWP 5030)]
[New Thread 1149274448 (LWP 5031)]
[New Thread 1140881744 (LWP 5032)]
[Thread 1140881744 (LWP 5032) exited]
[Thread 1132489040 (LWP 5030) exited]
[Thread 1149274448 (LWP 5031) exited]
[New Thread 1149274448 (LWP 5033)]
[New Thread 1132489040 (LWP 5034)]
[Thread 1149274448 (LWP 5033) exited]
[Thread 1132489040 (LWP 5034) exited]
[New Thread 1132489040 (LWP 5035)]
[New Thread 1149274448 (LWP 5036)]
[New Thread 1140881744 (LWP 5037)]
[Thread 1140881744 (LWP 5037) exited]
[Thread 1132489040 (LWP 5035) exited]
[Thread 1149274448 (LWP 5036) exited]
[New Thread 1149274448 (LWP 5038)]
[New Thread 1132489040 (LWP 5039)]
[Thread 1149274448 (LWP 5038) exited]
[Thread 1132489040 (LWP 5039) exited]
[New Thread 1149274448 (LWP 5055)]
Document::getMetaData: trying module 'pubmed'
[New Thread 1132489040 (LWP 5056)]
[New Thread 1140881744 (LWP 5057)]
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
openCB: result OK, opened
Done!
Waiting...
Woo, read 865 bytes
readCB: result OK
Waiting...
[New Thread 1082399056 (LWP 5058)]
[Thread 1140881744 (LWP 5057) exited]
Waiting...
Woo, read 0 bytes
readCB: EOF
Done!
[New Thread 1140881744 (LWP 5059)]
Waiting...
closeCB: result OK, closed
Done!
[Thread 1132489040 (LWP 5056) exited]
referencer_download: got 865 characters
[Thread 1140881744 (LWP 5059) exited]
/usr/local/lib/referencer/pubmed.py:43: DeprecationWarning: raising a string exception is deprecated
  raise "pubmed.get_citation_from_doi: DOI not found"
pubmed.resolve_metadata: Got no metadata
>> referencer_document_dealloc
Document::getMetaData: module 'lyx' has no suitable capabilities
Document::getMetaData: module 'arxiv' has no suitable capabilities
Document::getMetaData: trying module 'crossref'
CrossRefPlugin::resolve: using url 'http://www.crossref.org/openurl/?pid=ourl_REDACTED:REDACTED&id=doi:10.1002/sim.1844)&noredirect=true'
[New Thread 1140881744 (LWP 5060)]
[New Thread 1132489040 (LWP 5061)]
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
openCB: result OK, opened
Done!
Waiting...
Woo, read 605 bytes
readCB: result OK
Waiting...
Woo, read 0 bytes
readCB: EOF
Done!
Waiting...
closeCB: result OK, closed
Done!
[Thread 1140881744 (LWP 5060) exited]
[Thread 1132489040 (LWP 5061) exited]

<?xml version = "1.0" encoding = "UTF-8"?>
<crossref_result version="2.0" xmlns="http://www.crossref.org/qrschema/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crossref.org/qrschema/2.0 http://www.crossref.org/qrschema/crossref_query_output2.0.xsd">

<query_result>
        <head>
                <email_address><email address hidden></email_address>
                <doi_batch_id>w001</doi_batch_id>
        </head>
        <body>

                <query status="unresolved" fl_count="0" >
                        <doi>10.1002/sim.1844)</doi>
                        <msg>DOI does not exist in CrossRef</msg>
                </query>

        </body>
</query_result>

</crossref_result>

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47763704600336 (LWP 5016)]
0x00002b70d4b73aaf in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string () from /usr/lib/libstdc++.so.6
(gdb) bt
#0 0x00002b70d4b73aaf in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string () from /usr/lib/libstdc++.so.6
#1 0x00000000004376fe in CrossRefParser::on_start_element (this=0x7fffde5f3410, context=<value optimized out>, element_name=<value optimized out>, attributes=<value optimized out>) at CrossRefPlugin.C:58
#2 0x00002b70d0708830 in Glib::Markup::ParserCallbacks::start_element () from /usr/lib/libglibmm-2.4.so.1
#3 0x00002b70d4213e9b in g_markup_parse_context_parse () from /usr/lib/libglib-2.0.so.0
#4 0x00002b70d0707ee9 in Glib::Markup::ParseContext::parse () from /usr/lib/libglibmm-2.4.so.1
#5 0x0000000000436519 in CrossRefPlugin::resolve (this=<value optimized out>, doc=@0xeb0c70) at CrossRefPlugin.C:233
#6 0x000000000043b820 in Document::getMetaData (this=0xeb0c70) at Document.C:509
#7 0x00000000004873a4 in RefWindow::addDocFiles (this=0x7fffde5f5700, filenames=@0x7fffde5f40a0) at RefWindow.C:1684
#8 0x0000000000488956 in RefWindow::onAddDocFile (this=0x7fffde5f5700) at RefWindow.C:1887
#9 0x00002b70d0715703 in Glib::SignalProxyNormal::slot0_void_callback () from /usr/lib/libglibmm-2.4.so.1
#10 0x00002b70d3b9d99a in g_closure_invoke () from /usr/lib/libgobject-2.0.so.0
#11 0x00002b70d3bad96a in ?? () from /usr/lib/libgobject-2.0.so.0
#12 0x00002b70d3baeaf3 in g_signal_emit_valist () from /usr/lib/libgobject-2.0.so.0
#13 0x00002b70d3baecc3 in g_signal_emit () from /usr/lib/libgobject-2.0.so.0
#14 0x00002b70cf7c85b3 in _gtk_action_emit_activate () from /usr/lib/libgtk-x11-2.0.so.0
#15 0x00002b70d3b9d99a in g_closure_invoke () from /usr/lib/libgobject-2.0.so.0
#16 0x00002b70d3bad6b8 in ?? () from /usr/lib/libgobject-2.0.so.0
#17 0x00002b70d3baeaf3 in g_signal_emit_valist () from /usr/lib/libgobject-2.0.so.0
#18 0x00002b70d3baecc3 in g_signal_emit () from /usr/lib/libgobject-2.0.so.0
#19 0x00002b70cf9ac2fa in gtk_widget_activate () from /usr/lib/libgtk-x11-2.0.so.0
#20 0x00002b70cf8ae3e0 in gtk_menu_shell_activate_item () from /usr/lib/libgtk-x11-2.0.so.0
#21 0x00002b70cf8afda6 in ?? () from /usr/lib/libgtk-x11-2.0.so.0
#22 0x00002b70cf8a215d in _gtk_marshal_BOOLEAN__BOXED () from /usr/lib/libgtk-x11-2.0.so.0
#23 0x00002b70d3b9d99a in g_closure_invoke () from /usr/lib/libgobject-2.0.so.0
#24 0x00002b70d3badcc8 in ?? () from /usr/lib/libgobject-2.0.so.0
#25 0x00002b70d3bae8c7 in g_signal_emit_valist () from /usr/lib/libgobject-2.0.so.0
#26 0x00002b70d3baecc3 in g_signal_emit () from /usr/lib/libgobject-2.0.so.0
#27 0x00002b70cf9a80ae in ?? () from /usr/lib/libgtk-x11-2.0.so.0
#28 0x00002b70cf89b4fb in gtk_propagate_event () from /usr/lib/libgtk-x11-2.0.so.0
#29 0x00002b70cf89c504 in gtk_main_do_event () from /usr/lib/libgtk-x11-2.0.so.0
#30 0x00002b70d18cb1dc in ?? () from /usr/lib/libgdk-x11-2.0.so.0
#31 0x00002b70d420dfd3 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#32 0x00002b70d42112dd in ?? () from /usr/lib/libglib-2.0.so.0
#33 0x00002b70d42115ea in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#34 0x00002b70cf89c883 in gtk_main () from /usr/lib/libgtk-x11-2.0.so.0
#35 0x00002b70ce8bfeec in Gtk::Main::run () from /usr/lib/libgtkmm-2.4.so.1
#36 0x0000000000463af5 in main (argc=1, argv=0x7fffde5f5b18) at main.C:90

Revision history for this message
GadAbraham (gad-abraham) wrote :

Here's another PDF it happens on, if I save the PDF and import it. "Add Reference with ID" works for the DOI (when used without parentheses, at least).

http://genomebiology.com/content/pdf/gb-2008-9-1-r22.pdf

Revision history for this message
John S (jcspray) wrote :

Probably a 64 bit specific bug. Bear with me while I install at 64 bit linux to sort this out.

Revision history for this message
John S (jcspray) wrote :

Fixed in hg

Changed in referencer:
status: New → Fix Committed
Revision history for this message
GadAbraham (gad-abraham) wrote :

Running the latest version Mercurial, it adds the file (the one from genomebiology.com) and doesn't crash (great!), but the CrossRef data isn't retrieved, perhaps because there's still a parenthesis in the DOI:

<?xml version = "1.0" encoding = "UTF-8"?>
<crossref_result version="2.0" xmlns="http://www.crossref.org/qrschema/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crossref.org/qrschema/2.0 http://www.crossref.org/qrschema/crossref_query_output2.0.xsd">

<query_result>
        <head>
                <email_address><email address hidden></email_address>
                <doi_batch_id>w001</doi_batch_id>
        </head>
        <body>

                <query status="unresolved" fl_count="0" >
                        <doi>10.1186/gb-2008-9-1-r22)</doi>
                        <msg>DOI does not exist in CrossRef</msg>
                </query>

        </body>
</query_result>

</crossref_result>

Revision history for this message
GadAbraham (gad-abraham) wrote :

Now I'm sure it's the parenthesis, if I manually edit the properties and remove it, the CrossRef data is retrieved fine.

Revision history for this message
John S (jcspray) wrote :

The regex that extracts dois from document text recognises parentheses as valid characters of a DOI. In fact, the relevant standard allows for any unicode character excluding control characters. For instance, this is a valid doi: 10.2983/0730-8000(2007)26[281:BAAESA]2.0.CO;2

Revision history for this message
GadAbraham (gad-abraham) wrote :

That may well be the case but as you can see the one with the single parenthesis at the end doesn't get resolved, whereas the one without it, does.

Revision history for this message
John S (jcspray) wrote :

What I was referring to is this: the DOI comes from the PDF's plain text, and it is not always possible to get out the correct DOI. The expression used is roughly "doi xx.xx/xx". where the x can be any character other than whitespace. This works on a lot of PDFs, but if there are spurious characters such as a closing parenthesis at the end of the DOI then there's no simple way to tell whether they're part of the code or not.

Anyway, I took another look at the code and added a special case for (doi:xx/xx) to remove the trailing parenthesis. Time will tell whether that breaks it for anything else.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.