Log of the #blacklight channel on chat.freenode.net

Using timezone: GMT-05:00
* ndushay leaves05:12
<erikhatcher>MARC indexing speed, butt kicked: http://groups.google.com/group/solrmarc-tech/browse_thread/thread/b9ba2ed86f5da97908:50
<Bill_Home>erikhatcher++ # WOW!08:57
<erikhatcher>prove it though08:58
<BillDueber>erikhatcher: I'm gonna do my best to. Won't have time to play with the code until tomorrow, but then I'm gonna dig in.09:01
<erikhatcher>all 10 lines of code? ;)09:02
<BillDueber>Need to see if I can extract or reproduce the marc-mapping code from solrmarc
No, not your code. solrmarc.
<erikhatcher>yeah, gotta wire it together, that's the missing link
should be "easy" enough
<BillDueber>erikhatcher: And we'll be able to see how much indexing time is the pre-processing of the marc record. 09:03
<erikhatcher>yup
<BillDueber>I admit, I'm gonna feel kinda bad for Bob if this turns out to be the golden egg it looks like. He's worked his ass off on solrmarc.09:04
We need some sucker to adopt the marc4j package, too.
<erikhatcher>i feel bad because i did an update handler just like this 1.5 years ago in a room with Bess, Bob, Andrew Nagy, etc at UVa09:05
and espoused it as the way to go
complexity is to be eschewed and fought ruthlessely
and the speed of my update handler is the same... shouldn't be any diff really09:06
and you could point solr (using the content stream stuff) to a MARC file or POST one (or more) in
we can achieve that with this DIH stuff too, need to look into content streaming
<BillDueber>I don't understand enough about what's going on to know why it's making a difference. When I index, the first million go like crazy. It *feels* like it's the merges that slow things down. But maybe it's something else? I don't even know how to start looking.09:07
And maybe I won't have to. If I use your stuff and it all works, well, hell, then I just wont' care.09:08
<erikhatcher>i'm not gonna look09:13
i'm satisfied with the DIH approach.
but of course, now comes the real work, making it production ready.
* jamieorc joins09:25
* bess joins09:51
<MrDys>erikhatcher++ #solr superstar10:13
<jrochkind_home>Sorry, I just emailed to. 10:58
But the thing is, we actually NEED the ability to locally define sophisticated mappings. Our apps all depend upon it.
<MrDys>jrochkind: I'm pretty sure the crux of the message was that it was possible, he just didn't do it for testing purposes11:04
<jrochkind>it's not clear to me that once you do it... you're not just going to run into the same perf problems. 11:05
at least if you use code similar to SolrMarc to do it. If you optimize the bottlenecks... then is the issue really optimizing bottlenecks in the complex mapping code, rather than shifting the method of interacting with solr? This is my question/concern.
<BillDueber>jrochkind: I think it *is* clear to me. Otherwise we wouldn't see such radical differences between indexing 100K records and 6M records in terms of throughput/second.11:06
<jrochkind>huh. okay then, fair enough.
But I can't use the new method until someone has the time to put in support for just as complex support for mapping as SolrMarc has.
<MrDys>...neither can most people using blacklight for marc11:08
<BillDueber>I guess that seems like the easier problem to me.
<jrochkind>BillDueber: Sweet. Maybe you want to write the code, heh. But sorry, don't mean to be annoying/obstructionist, very pleased that people are looking into this. 11:19
It would be convenient if a request handler approached allowed configuration of mapping in exactly the same formats as SolrMarc, otherwise I'm going to need to 'translate' my mappings, which are quite involved.
<BillDueber>jrochkind: I would, actually, in my Copious Free Time. I'd like to have a more complex format to specify the marc fields, too (e.g., include indicators)11:20
<jrochkind>heheh.
<BillDueber>But yeah, parsing the solrmarc format would be a necessary first step
<jrochkind>Yes, in my non-existent copious freetime I'd like to do the same.
But I was actually about to try adding support for indicators to SolrMarc. That is on my actual (not fantasy) list of things to be done, shoudln't be that hard.
Of course, if everyone's gonna abandon SolrMarc, than it's less useful to add that feature to SolrMarc.
BillDueber: But as usual, I have huge trust in your understanding of the problem domain in particular, if you're involved in any coding of this stuff, it'll definitely make me happy. 11:22
<BillDueber>jrochkind: You trying to butter me up so I'll work on something to your benefit????11:29
jrochkind: Or maybe we could provide a solrmarc->new_format converter. That'd be the same thing, and would allow simpler internal code.11:30
<jrochkind>Nah, I just like your code.
<BillDueber>jrochkind: That's 'cause you don't have to work on it. I hate my code :-)
<jrochkind>BillDueber: I now have a bunch of stuff coded up in Bean Shell for solrmarc. That's gonna have to be rewritten either way, I guess. bah.
BillDueber: But I like the idea of supporting ruby scripts via jruby instead of or in addition to Bean Shell.
<BillDueber>jrochkind: Well, not if you're happy with solrmarc. No absolute need to switch.11:31
<jrochkind>Well, I'm not happy with the indexing time, naturally! I am happy with the power/flexibility, especially in the new version.
But I am still not entirely convinced that once you add the power/flexibility into the new approach... it'll end up being super slow too.
<BillDueber>jrochkind: Yeah, I haven't messed with the new version. We're doing all our custom indexing via custom java methods.
<jrochkind>BillDueber: Bean Shell is actually kinda awesome. Although has unanswered performance questions of it's own, heh. 11:32
<BillDueber>jrochkind: I just don't see why it would get slower if the problem is marc processing. It would be as slow (per record) on the first record as on the 5 millionth.
<jrochkind>you do have a point.
But it could be related to something solrmarc does that could be easily fixed without abandoning it. With just benchmarking and not profiling, we don't know _where_ the bottleneck is. 11:33
For instance, solrmarc using actual HTTP connection, and connecting to the solr streaming update server... how much of a difference will it make?
There are a buncha factors involved, can't just assume the tests mean "request handler approach is better than seperate indexer over HTTP approach." Coudl be something else wrong with solrmarc's particular implementation of it's approach, right?11:34
* cbeer__ joins11:35
* cbeer_ leaves
<jrochkind>But Ross tells me on listserv that the idea might be to use actually existing present SolrMarc code for mapping with this new approach too. I didn't realize/think of that. That DOES sound potentially more useful, and potentially would allow exactly the same config to be used to boot (even Bean Shell scripts!). 11:39
<rsinger>honestly, the current solrmarc performance problem seems like this should be a very priority11:41
<BillDueber>erikhatcher: In EntityProcessorBase, nextRow() returns a Map<String,Object>. What's the value supposed to be for a multivalued field? A Set?
<rsinger>because the status quo makes it practically impossible to scrap and rebuild your index
<BillDueber>A couple hours vs overnight is a difference that makes a difference11:42
<jrochkind>very true. but sadly, I have many many priorities. Since I don't even have an acceptable local Blacklight up yet, and have a ways to go to get there. 11:43
Our current OPAC index rebuild is about the same as SolrMarc.
<rsinger>well, there's definitely no "playing around with different indexing setups" with the status quo
<jrochkind>Sure there is, you just do it with a demo sample of your complete corpus, which is what I'm doing now. 11:44
<rsinger>so it's ok if your solr machine goes down or your index gets corrupted being down for a *day*?11:47
just to rebuild the index?
<jrochkind>"ok" is relative. That's the situation with my present OPAC. Except there are workarounds to keep you from being down for a day, like keeping backups of your index. 11:51
which is what I have to do presently anyway. Is it good? No. But it's not a step backwards for me to still be in that situation. 11:52
so for me personally right now, my higher priorities have to be adding things to my Blacklight that are _already_ in my present OPAC, that I don't want to lose, is all I'm saying.
<erikhatcher>i'm trying to catch up here... i'm perplexed by the confusion here, and i encountered the same thing when i did the update request handler for marc 1.5 years ago too....13:00
just frickin' plug SolrMarc's API in to the placeholder i commented!
simple as that
except it probably isn't _that_ simple with SolrMarc - might need some tweaking, but nothing extensive i wouldn't think13:01
use the same ol' .properties, etc
will probably want to add some params that point to the .properties in the config file, but no biggie there
<jrochkind>erikhatcher: Yeah, I got that after rsinger said it. Right on. 13:07
erikhatcher: sorry for being initially confused.
I still am confused about WHY this would result in such a great performance gain, so wouldn't bet the farm that it will. But if it's easy enough to try and see... then we'll find out soon, and maybe I'll find out that I should have bet the farm. :)13:08
<erikhatcher>you skeptics!
how much easier to try could it be?
download solr 1.4
get my .zip
follow one or two steps
<jrochkind>integrating SolrMarc into your request handler is what needs to be tried.
<erikhatcher>and boom13:09
any speed degradation from here on out will be solrmarc's fault
but this is as fast as it can go, i think
<jrochkind>I believe your numbers. The question is, if once you integrate SolrMarc indexer into your request handler... do you lose all the perf gains cause they were really bottlenecks in solrmarc?
<erikhatcher>without parallelizing
i doubt you lose much at all
i think SolrMarc was just really lame on the communication to Solr stuff13:10
it's too messy to even dig into though... all that proxy reflection junk
<jrochkind>Oh, I _definitely_ believe all the problems are SolrMarcs fault. The question is if the 'fault' in SolrMarc is in it's _mapping_ (whihc we need anyway, and need to optimize if it's got problems), or it's "adding to lucene index", which we can indeed completely scrap and replace with your version.
Get it?
<erikhatcher>try it and see
<jrochkind>The weird thing is that since you saw the bad performance in SolrMarc even with http post, the problme is NOT in the weird proxy reflection junk.
<erikhatcher>jrochkind: the proxy reflection stuff is there for the http too13:11
<jrochkind>Oh? BAH.
<erikhatcher>it's bad bad bad.... SolrProxy and it's ilk
<jrochkind>Why bob why?
I don't see why it should be needed at all for http.
<erikhatcher>but i didn't actually dig into trying to tune what was there, wanted to see how fast i could ingest that marc file first
jrochkind: it shouldn't be in there for embedded or http
SolrJ already has a SolrServer abstraction. use that!13:12
<jrochkind>But at any rate, here's my point. I'm not sure the difference is "external software using http post" vs "solr request handler". The difference is probably "solrmarc does some really weird crap that it shouldn't be doing."
ONE way to remove the really weird crap is to take the mapping stuff alone and graft it on to your request handler. But you can also just write a normal post-to-http app that didn't have the weird crap. no?
<erikhatcher>just use solrmarc for mapping, indeed13:13
and yes someone could do the other as well, and in fact with a SolrProxy implementation that didn't have reflection, might be worth a try. but really, it's probably easier to just graft solrmarc mapping into the entity processor than to keep talking about ;)13:14
<jrochkind>The confusion is I thought your argument was that "built in request handler" (which of course you have to post marc to anyway) was probably faster than 'external app that posts to solr over http.' Which now that I say it like that, is so ridiculous that I don't know why I thought it was your argument. :)
For me, talking about code is always easier than working with code I'm not yet famliar with. :)
I still think I'd prefer a client that does the mapping and then posts to solr over a built in request handler, obviously. But yes, such a client should just do the mapping and do an ordinary post, not do the weird crap.
[I don't know what that 'obviously' was doing there, eliminate it. :) ]13:15
<erikhatcher>DataImportHandler *is* a request handler13:16
with full/delta/stop/debug capabilities baked in
* ndushay joins
<erikhatcher> i hit /solr/dataimport?command=full-import to index that marc file
<jrochkind>well,yes. the custom request handler for MARC, I'm not seeing the attraction of.
<erikhatcher>damn y'all.... i try to help out and then i have to defend the best way to do things :)13:17
keep yer slow indexer! :P
<jrochkind>I am inclined to believe you on the best way to do things, don't get me wrong.
And, no, you've helped out a LOT.
<erikhatcher>i'm punchy.... stayed up all night on this, still haven't slept13:18
<jrochkind>But I _think_ we agree that what you showed was not neccesarily that a custom request handler that takes marc must be faster than an indexer that posts Solr XML, right? Instead, you've just shown that there's SOMETHING completely wacky in solrmarc. Which is probably the weird proxy stuff.
<erikhatcher>the attraction of DIH is it's a "standard" conduit into Solr with flexibility to tie to other sources easily, etc
monitor and control
<jrochkind>yeah, you do make a good argument there. 13:19
<erikhatcher>something is way wacky in solrmarc, no question
i don't feel the need to dig deeper into it really since i've shown a better way to do this marc indexing
<jrochkind>So let's say we incorporate SolrMarc's mapping into a marc DIH. Now, depending on what source I'm indexing from, I may want to use _different_ mapping logic. So I guess I'd just set up multiple DIH handlers, with different config?
<erikhatcher>what are the downsides?
yeah13:20
one way to do it, or parameterize it with request params
<jrochkind>The config itself is way too big to send in a request param. But I guess you could send a name of a config set to use in a request param.
<erikhatcher>and don't forget that DIH comes with hooks for mapping data already too... Transformers they call it
don't send in the config... send in the config name
<BillDueber>They're more than meets the eye.
<jrochkind>righto, makes sense.
one possibly trivial downside is having to restart solr to modify the index mappings is kind of annoying in development. 13:21
<erikhatcher>jrochkind: wouldn't have to restart solr to update mappings13:22
<jrochkind>okay, cool. 13:23
mostly I think you've convinced me. :)
<erikhatcher>DIH even has a reload command to reload the DIH config file (which would not matter for the properties files they'd be read dynamically every time)(
<jrochkind>but it still seems good to be clear that making a DIH handler is not what's _required_ to fix the speed up. Getting rid of SolrMarc's weirdness is what's required to fix the speed up. Making it into a DIH handler may give added benefits, and may be one convenient way of getting rid of SolrMarc's weirdness. 13:24
I think.
I think making a DIH handler that uses SolrMarc mapping, with full config files, and multiple configs that can be selected in params... is a bit of work. But probably not more work than fixing SolrMarc's weirdness some other way. 13:25
<erikhatcher>solrmarc should focus on the mapping magic
not the solr indexing part13:26
<jrochkind>yeah, I hear ya. you have convinced me, I think.
<erikhatcher>i'm sure the indexing part could be fixed, but after the implementation i created ran so fast and gives a simple spot to make an API call to convert a Record to HashMap.... i mean,c'mon
<jrochkind>still gonna take some work to get there. I'd like to do the work, but my bosses have other priorities. 13:27
* ndushay leaves
<erikhatcher>tuning out for a bit.... l8r13:29
* erikhatcher leaves13:46
* ndushay joins13:54
* erikhatcher joins14:55
* erikhatcher leaves15:10
* bess_away leaves15:16
* bess joins15:19
* cbeer__ leaves16:17
* bess leaves16:33
* ndushay leaves17:01
* ndushay joins17:05
* cbeer_ joins17:07
* jamieorc leaves18:39
* bess joins19:02
* ndushay leaves20:19
* ndushay joins20:20
* ndushay leaves
* bess leaves20:27
* ndushay joins20:50
* Naomi joins20:51
* ndushay leaves20:58
* erikhatcher joins21:11
* erikhatcher leaves21:59
* ndushay leaves22:25
* cbeer__ joins23:25
* cbeer_ leaves23:33

Generated by Sualtam