Log of the #blacklight channel on chat.freenode.net

Using timezone: GMT-05:00
* ndushay leaves03:32
* cbeer joins09:22
* bess joins10:04
* MrDys joins10:08
* bess leaves10:23
* jamieorc joins10:33
* erikhatcher leaves11:18
* bess joins11:26
* bess leaves
* bess joins11:28
* erikhatcher joins11:48
* BillDueber joins12:09
* jkeck joins12:13
* ndushay joins13:32
* ndushay leaves13:35
* ndushay joins14:02
* ndushay leaves15:33
* ndushay joins15:35
* bess leaves16:27
<ndushay>erikhatcher: you here?16:48
(or there)
<erikhatcher>i am, but only for a few more
fyi - i plan on getting one of the MARC sets that rsinger pointed me to and giving SolrMarc a try this week to see if i can find out what is causing the indexing slowdown16:49
<ndushay>erikhatcher: ok
i can give you my jar if you like16:50
with all the stanford processing goodies
but we have 999s
i can probably get you our marc data also
just need to find out where to put it
and don't forget - we don't write to index via Solr; we just write to the index directly16:51
<erikhatcher>ndushay: i'll see if i can provide an ftp spot for that stuff
ndushay: well, that writing to the index directly is what i want to change! it's not the scalable way to do it
whatever is going on can be fixed if the actual MARC processing isn't the bottleneck - no reason indexing is taking that long otherwise
<ndushay>erikhatcher: both bob and i are not facile in profiling16:53
erikhatcher: any help you can provide would be awesome
it's about 8G of data for our 6M records
and the index is about 27G
<erikhatcher>ndushay: how many MARC files?
<ndushay>6G
M for million - whoops16:54
<erikhatcher>not how big, how many?
<ndushay>oh
17
<erikhatcher>one issue is if you want to parallelize the indexing and process multiple MARC files at a time, you can't do it with the embedded indexer anyway
at least not without writing the code to be multithreaded yourself
<ndushay>k
<erikhatcher>but with indexing via HTTP you can fire up multiple indexers
i'll take a look at this stuff this week. maybe not tomorrow, but soon, promise16:55
<ndushay>so i would potentially do 17 indexers?
<erikhatcher>this is in my ramp up to c4lcon :)
<ndushay>awesome
<erikhatcher>yup, you could fire up that many indexers probably
<ndushay>let me know if you want me to do any part of the solr black belt workshop
i figure i'll just be in there learning and helping folks when i can.16:56
<erikhatcher>that'd potentially cut indexing down by 1/17th
<ndushay>yep.
<erikhatcher>to do any part of it? you're co-teaching it! :) 50/50 dear!
<ndushay>lol
<erikhatcher>no worries... it'll be fun
<ndushay>well, what could i possibly have to say that you wouldn't say better?
i think our marc processing is a lot more involved that uva16:57
just fyi.
!#$!@#$ call numbers, for one thing.
and 500 flavors of title fields.
<erikhatcher>even UVa's indexer is too slow IMO
anyway, gotta run, more later16:58
* erikhatcher leaves
<jrochkind>the marc processing is definitely involved. First step is figuring out how much of your/our time is SolrMarc processing, and how much is Solr indexing itself. 17:03
If a bunch of it is SolrMarc, then we can probably find bottlenecks to optimize in SolrMarc. But, yeah, there's gonna be a lot of processing no matter what. 17:04
* ndushay leaves18:07
* ndushay joins18:47
* cbeer_ joins18:51
* ndushay leaves19:31
* jkeck leaves20:04
* ndushay joins21:00
* erikhatcher joins21:18
* bess joins21:24
* bess leaves21:32
* bess joins21:52
* bess leaves23:54

Generated by Sualtam