|
From: Cornelia Plott <c.plott@fz-juelich.de>
|
|
Subject: Re: runtime experience for bibrank citation calculation?
|
|
To: Tibor Simko <tibor.simko@cern.ch>
|
|
Cc: "project-cdsware-users@cern.ch" <project-cdsware-users@cern.ch>,
|
|
"Haustein, Stefanie" <s.haustein@fz-juelich.de>,
|
|
"Tunger, Dirk" <d.tunger@fz-juelich.de>,
|
|
"Holzke, Christoph" <c.holzke@fz-juelich.de>
|
|
Date: Mon, 28 Feb 2011 10:44:54 +0100
|
|
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de;
|
|
rv:1.9.2.13)Gecko/20101207 Lightning/1.0b2 Thunderbird/3.1.7
|
|
|
|
Hi Tibor,
|
|
|
|
Thanks for your answer and your hints.
|
|
|
|
>> We have loaded and indexed about 1,8 Mio records into our only local
|
|
>> open invenio instance. Normaly the records have a large reference
|
|
>> block (like below).
|
|
> Do you know how many citer-citee pairs do your records generate? How
|
|
> many references do you have in total for these 1.8M records? Do
|
|
> references usually refer to other existing records in your system, or do
|
|
> they refer to outside records that you do not store?
|
|
|
|
In total we have in for this 1.8 Mio records 41 .6 Mio references. Yes,
|
|
there exist references, which point to outside records that we do not
|
|
store. We dosn't now how many citer-citee pairs our records will generate.
|
|
|
|
>> 2011-02-22 03:03:39 --> d_report_numbers done 0 of 15000
|
|
>> 2011-02-23 10:14:24 --> d_report_numbers done fully
|
|
> Citation ranking method works with big citation dictionaries that are
|
|
> usually held in memory. Do you have enough RAM on your box to hold
|
|
> them, or did your box start to swap perhaps? Have you tuned your MySQL
|
|
> DB settings and do you have large enough max_allowed_packet and friends
|
|
> in your /etc/my.cnf?
|
|
|
|
This invenio instance not runs on a virtual machine and have really 16
|
|
GB RAM.
|
|
|
|
MemTotal: 16627700 kB
|
|
MemFree: 5153924 kB
|
|
Buffers: 327200 kB
|
|
Cached: 9792016 kB
|
|
SwapCached: 0 kB
|
|
Active: 2613668 kB
|
|
Inactive: 8401064 kB
|
|
HighTotal: 15854912 kB
|
|
HighFree: 5144616 kB
|
|
LowTotal: 772788 kB
|
|
LowFree: 9308 kB
|
|
SwapTotal: 5144568 kB
|
|
SwapFree: 5144476 kB
|
|
|
|
We had also tuned our MySql DB settings like this:
|
|
[mysqld]
|
|
...
|
|
#key_buffer = 384M
|
|
key_buffer = 2G
|
|
#key_buffer_size = 2M
|
|
key_buffer_size = 512M
|
|
max_allowed_packet = 16M
|
|
table_cache = 512
|
|
#sort_buffer_size = 2M
|
|
sort_buffer_size = 16M
|
|
#read_buffer_size = 2M
|
|
read_buffer_size = 64M
|
|
#read_rnd_buffer_size = 8M
|
|
read_rnd_buffer_size = 128M
|
|
#myisam_sort_buffer_size = 64M
|
|
myisam_sort_buffer_size = 256M
|
|
thread_cache_size = 8
|
|
query_cache_size = 32M
|
|
...
|
|
|
|
We change the settings like an recommendation from Baron Schwarz "High
|
|
performance MySQL: optimization, backups, replication and more". We
|
|
don't changed the max_allowed_packet. What would be a good size?
|
|
|
|
|
|
> Moreover, it would be helpful if you could also run bibrank for say ~100
|
|
> sample records via Python profiler so that we'd know where the inside
|
|
> bottlenecks are. Here is an example of how to submit such a profiled
|
|
> bibrank task:
|
|
>
|
|
Here our result from the profiled bibrank task:
|
|
|
|
./bibrank -u admin -w citation -a -i 1-100 --profile=t
|
|
|
|
ncalls tottime percall cumtime percall filename:lineno(function)
|
|
1 0.000 0.000 12.029 12.029 bibtask.py:755(_task_run)
|
|
1 0.000 0.000 12.025 12.025 bibrank.py:128(task_run_core)
|
|
1 0.000 0.000 12.025 12.025 bibrank_tag_based_indexer.py:482(citation)
|
|
1 0.043 0.043 12.025 12.025 bibrank_tag_based_indexer.py:329(bibrank_engine)
|
|
1 0.016 0.016 11.737 11.737 bibrank_tag_based_indexer.py:86(citation_exec)
|
|
1 0.001 0.001 11.656 11.656 bibrank_citation_indexer.py:60(get_citation_weight)
|
|
1 0.118 0.118 11.310 11.310 bibrank_citation_indexer.py:570(ref_analyzer)
|
|
17303 0.330 0.000 9.560 0.001 dbquery.py:121(run_sql)
|
|
2141 0.021 0.000 9.300 0.004 search_engine.py:1988(search_unit)
|
|
17303 0.545 0.000 8.506 0.000 cursors.py:127(execute)
|
|
1360 0.059 0.000 8.293 0.006 search_engine.py:2032(search_unit_in_bibwords)
|
|
17303 0.099 0.000 7.480 0.000 cursors.py:308(_query)
|
|
17303 6.400 0.000 7.045 0.000 cursors.py:270(_do_query)
|
|
2725 0.011 0.000 6.139 0.002 data_cacher.py:71(recreate_cache_if_needed)
|
|
2720 0.012 0.000 6.130 0.002 search_engine.py:320(get_index_stemming_language)
|
|
2729 0.056 0.000 6.117 0.002 dbquery.py:256(get_table_update_time)
|
|
2720 0.011 0.000 6.108 0.002 search_engine.py:310(timestamp_verifier)
|
|
6193 0.083 0.000 2.499 0.000 search_engine.py:536(get_index_id_from_field)
|
|
8 0.001 0.000 1.186 0.148 bibrank_citation_indexer.py:947(insert_into_cit_db)
|
|
892 0.005 0.000 1.044 0.001 bibrank_citation_indexer.py:47(__call__)
|
|
781 0.015 0.000 1.039 0.001 bibrank_citation_indexer.py:54(get_recids_matching_query)
|
|
782 0.023 0.000 1.023 0.001 search_engine.py:1726(search_pattern)
|
|
9 0.936 0.104 0.936 0.104 dbquery.py:315(serialize_via_marshal)
|
|
666 0.025 0.000 0.725 0.001 search_engine.py:2091(search_unit_in_idxphrases)
|
|
17287 0.353 0.000 0.608 0.000 cursors.py:105(_do_get_result)
|
|
2113 0.028 0.000 0.575 0.000 bibrank_citation_indexer.py:997(insert_into_missing)
|
|
17303 0.074 0.000 0.481 0.000 cursors.py:55(__del__)
|
|
17303 0.086 0.000 0.408 0.000 cursors.py:60(close)
|
|
1 0.000 0.000 0.398 0.398 bibrank_citation_indexer.py:921(insert_cit_ref_list_intodb)
|
|
...
|
|
|
|
Have you already some optimisation hints or need you more informations
|
|
about our system?
|
|
|
|
Thanks & Kind Regards
|
|
Cornelia
|
|
|
|
Cornelia Plott
|
|
Zentralbibliothek
|
|
Forschungszentrum Jülich
|
|
D-52425 Jülich
|
|
GERMANY
|
|
|
|
Tel: ++49-2461-616206
|
|
Email: c.plott@fz-juelich.de
|
|
Web: http://www.fz-juelich.de/zb
|
|
|
|
|
|
|
|
------------------------------------------------------------------------------------------------
|
|
------------------------------------------------------------------------------------------------
|
|
Forschungszentrum Juelich GmbH
|
|
52425 Juelich
|
|
Sitz der Gesellschaft: Juelich
|
|
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
|
|
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
|
|
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
|
|
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
|
|
Prof. Dr. Sebastian M. Schmidt
|
|
------------------------------------------------------------------------------------------------
|
|
------------------------------------------------------------------------------------------------
|
|
|