Run parsers and generate hpz files
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Hamlet |
In Progress
|
High
|
WVU Modeling Intelligence Lab |
Bug Description
We now have a fully functioning pipeline but are lacking the data to use with it. Each parser will create a .hpz file that is used as input to the pre-processor. We need to generate these files.
This has already been done for the law dataset (half of it at least). It also needs done for the following datasets.
* STEP
* Text/Html
* Java
A large task for this will be finding the data to run the parsers on.
* STEP - there is a folder on wisp at hamlet/data/STEP. This contains a collection of STEP files we extracted from the nara data. this would be a good starting point. Talk to Greg if you need assistance with this.
* Text/HTML - This i'm not sure about. Should we use the data from the nara folks? Talk to adam and see what he has to say since he wrote this.
* Java - This is your cup of tea. The good thing about this parser is that it has a wide variety of possible datasets to be run on. I recommend you start with weka. Try to build an hpz file for at least 3 different large open source projects.
I'm going to ask that Adam and Greg comment on this bug with instructions/
Changed in hamlet: | |
assignee: | nobody → mhull1 |
About the only thing I can think to add about the step parser is that you might need to change the hard-coded directory in it to match where you store the step files. Just give me a shout if you need any help.