http://intro2libsys.info/pycon-jp-2015

Experimental Results

Redis KEYS Experiment

The test set for most of these experiments uses RDF BIBFRAME graphs transformed from Colorado College's most popular material's MARC records using the Library of Congress's marc2bibframe conversion program. Some properties of this sample set are:

MARC Records Triples Works Instances Size MB
17,144 2,539,234 33,535 17,144 625.05

Using the Redis KEYS command that supports glob pattern matching on the entire datastore, To test the raw performance of retrieving an single RDF graph for a particular work, we'll use the Python timeit function and graph the results with Matplotlib with this code snippet to try to simulate a linear increase in useage, initially starting from 1 to 10

Issues

  • The Redis KEYS command is O(n) so as the size of the data set increases, the performance of the command degrades in linear (or worse) fashion
  • The KEYS command also blocks the Redis Server from receiving any additional commands and is not advised to use in production systems.
  • No de-duplication of a subjects; i.e. if work 1 is generated from converting a MARC record for Pride and Prejudice copy 1, and work 2 is generated from converting a MARC record copy 2, we'll duplicate the triples even though the two records are essentially the same.
     
import timeit
redis_setup_stmt = """import redis
cc_popular = redis.StrictRedis()"""
redis_stmt="""cc_popular.keys("17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:*:*")"""
redis_trials = []
for count in range(1, 11):
	redis_timer = timeit.timeit(stmt=redis_stmt, setup=redis_setup_stmt, number=count)
	redis_trials.append(redis_timer)

     
     


Testing Falcon API

Now that we have baseline for the raw performance of the KEYS command, we'll now run the same test but this time use the requests to connect to a running Falcon API instance to see what overhead adding a REST service adds to the Redis

    
import requests
setup_stmt = """import requests"""
falcon_stmt="""requests.get("http://localhost:18150", data={"s": "http://catalog.coloradocollege.edu/58052458"})"""
falcon_trials = []
for count in range(1,11):
    falcon_timer = timeit.timeit(stmt=falcon_stmt, setup=setup_stmt, number=count)
    falcon_trials.append(falcon_timer)
    
    


Testing Asynco

Here is the setup testing code for the Asynco server:

    
setup_stmt = """import requests"""
asynco_stmt = """requests.get("http://localhost:7000?s=http://catalog.coloradocollege.edu/58052458")"""
asynco_trials = []
for count in range(1,11):
    asynco_timer =  timeit.timeit(stmt=asynco_stmt, setup=setup_stmt, number=count)
    asynco_trials.append(asynco_timer)
    
    

# of Runs Direct Redis Falcon API Asynco
1 1.2281699029845186 1.2685371820116416 1.3175828389939852
2 2.6082807540078647 2.768362994014751 3.0104295710334554
3 3.5183271229616366 3.5287525659659877 3.708122154988814
4 4.681248573004268 4.727095118025318 4.8970251709688455
5 5.818655456008855 5.897049093968235 6.1605203920044005
6 7.248136567999609 7.0755964029813185 7.360615084005985
7 8.514002248994075 8.32872693700483 8.533214915019926
8 9.544208430976141 9.583168470009696 9.845916612015571
9 10.48881931899814 10.6294925850234 11.09655712498352
10 13.434723928978201 13.046950068033766 12.24355677596759

Interpreting

From this first pass, we see that regardless of the test, as the number of trials increased, our execution time increased in roughly linear fashion. This is encouraging and what we would expect. Second, neither Falcon or the Asynco HTTP server added much overhead to the execution of the code. Finally, there are not any appreciable differences between our Falcon and Asynco server code, making the choice between the more of a choice between style and easy of maintenance rather than performance.

We should be careful about drawing too many conclusions from this test run as it should be repeated a few more times to make sure our results are still consistent over time. But for now, we'll look at how we can improve the Redis backend implementation to use something other than the KEYS command before we start trying to optimize what server technology to use; Falcon or Asynco.


Redis SCAN Experiment

As mentioned in the issues section in our first experiment, using the Redis KEYS is discouraged in production. Redis provides an alternative to the keys which doesn't block the server called SCAN that has O(1) time complexity per call. The SCAN command takes a cursor and with each call does one iteration. Eventually, calling the SCAN command multiple times will iterate through the entire Redis key space. The slice of the command space by which the SCAN can be adjusted with a COUNT parameter.

In the test setup code below, we are just going to keep the number of runs consistent at 10 but vary the size of the COUNT parameter. Remember from our last test, the average it took to execute 10 runs KEYS command with our straight Redis call was 13 seconds.


redis_setup = """import redis
cache_redis = redis.StrictRedis()
def test_redis_scan(count):
    cur = 0
    output = []
    while 1:
        cur, results = cache_redis.scan(cur, match="17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:*:*", count=count)
        if len(results) > 0:
            output.extend(results)
        if not cur:
            break
    return output"""
scan_trials = []
for count in [1000, 2500, 5000, 10000, 25000, 50000, 75000, 100000, 250000, 500000]:
    redis_stmt = "test_redis_scan({})".format(count)
    redis_timer = timeit.timeit(stmt=redis_stmt, setup=redis_setup, number=10)
    scan_trials.append((count, redis_timer))

Count Time (seconds)
100093.26906220900128
2500102.64958874002332
500096.19367762497859
1000094.76345123304054
2500091.95245929301018
5000097.29142061399762
7500090.68025794799905
10000091.00262290996034
25000096.62939810799435
50000088.64146210596664

Yikes!

Even the best result with 500,000 count size for 10 trials was 88 seconds, significantly worse than our KEYS test from above. Is there any alternative approaches we can examine? Yes! One of the nice features of Redis is that we can use a different data structure to improve performance.


Using Redis Hashes

An alternative to using the constructed Redis triple key from the three different SHA1 for each subject-predicate-object triple, is to use Redis hashes to store triple information. A first pass at this approach would be for all of the predicate-objects of a subject would be stored at a hash with a key suffix of :predicate-objects. In this model the subject hash fields would still use the "predicate-sha1-digest:object-sha1-digest" pattern and with the field value being an 1 integer. For the subject used in our example, the companion redis key would be:
17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:predicate-objects.


>>> local_redis.hgetall('17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:predicate-objects')
{b'f610f749c5c2eaf6718eb2bc24bf74559d14637d:1509401ff5fecad20101ebb147f77d82407b75ae': b'1', 
 b'a548a25005963f85daa1215ad90f7f1a97fbe749:899c924e4dc30d79cb6d11b5d03542e1e3d8c381': b'1', 
 b'a20301af19937f3787275c059dae953eaff2cb5f:262798fe370ba8f32d0aa9b40381c1cab4b8ae22': b'1', 
 b'a548a25005963f85daa1215ad90f7f1a97fbe749:8eda00df792b5ab31f14febf8fbae9f78bf820e7': b'1', 
 b'3c197cb1f6842dc41aa48dc8b9032284bcf39a27:5d1377f4476a1cbfb3caea106dc6b0a7d086410a': b'1', 
 b'0ba43b890c824baa68a0d304bbd0baab6bff921a:9c1020be297eed7dc8106954d1600ecd68448a64': b'1', 
 b'38e711cd1701b5f99ae70f2c9974aae8a24944d1:23b04598de49ffeeddf305afbdea4d7fa3127299': b'1', 
 b'0f08c96e756a4fa720257bf3090efdf76b5d3acc:0672781095db8ffcb1977d58687125df74c3e5c7': b'1', 
 b'1800d68bd2be5785c5505e09329f3d37b02c9d01:978170a2adcba8e0f7c41158c420d642110cd59e': b'1', 
 b'3c197cb1f6842dc41aa48dc8b9032284bcf39a27:0139c97cff6054f868eadb4b0e70a58da37238bd': b'1'}

Now we'll run this test code and see what the results are


redis_setup = """import redis
cache_redis = redis.StrictRedis()"""
redis_stmt = """cache_redis.hgetall('17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:predicate-objects')"""
redis_hash_trials = []
for count in range(1,11):
    hash_timer =  timeit.timeit(stmt=redis_stmt, setup=redis_setup, number=count)
    redis_hash_trials.append(hash_timer)

# of Runs Time in Seconds
10.00512834801338613
20.002946369000710547
30.002188334008678794
40.0025909639662131667
50.0024501450243405998
60.0027079840074293315
70.0031070280238054693
80.003147128038108349
90.0033804820268414915
100.004014718986582011

Even running the standard timeit number of 1,000,000 runs results in respectable performance using hashes:

>>> hash_timer =  timeit.timeit(stmt=redis_stmt, setup=redis_setup)
>>> print(hash_timer)
243.62794216600014

About RDF & Linked Data BIBFRAME & DPLA Redis Python Falcon & Asynco Experiments Next Steps Resources