Query 100Gb of S3 data in milliseconds

0

I have json data in s3. data looks like

{

    "act_timestamp": 1576480759864,
    "action": 26,
    "cmd_line": "\\??\\C:\\Windows\\system32\\conhost.exe 0xffffffff",
    "guid": "45af94911fb911ea827300270e098ff0",
    "md5": "d5669294f78a7d48c318ef22d5685ba7",
    "name": "conhost.exe",
    "path": "C:\\Windows\\System32\\conhost.exe",
    "pid": 1968,
    "sha2": "6bd1f5ab9250206ab3836529299055e272ecaa35a72cbd0230cb20ff1cc30902",
    "proc_id": "45af94901fb911ea827300270e098ff0",
    "proc_name": "gcxvdf.exe"
}

I have around 100GB of such jsons stored in s3, in folder structure like year/month/day/hour. I have to query this data and get results in milliseconds. query can be like:-

select proc_id where name='conhost.exe',
select proc_id where cmd_line contains 'conhost.exe'.

I tried using AWS Athena and Redshift but both are giving results around 10-20 seconds. I even tried with Paraquet and orc file formats.

Is there any tool/technology/technique which can be used to query this kind of data and get results in milliseconds.

(Reason for response time to be in milliseconds is because I am developing interactive application.)

amazon-web-services
amazon-s3
amazon-redshift
amazon-athena
asked on Stack Overflow Dec 16, 2019 by AbhiK • edited Dec 16, 2019 by a_horse_with_no_name

2 Answers

1

I think you are looking for a distributed search system like SOLR or elastic search (I am sure there are others, but those are the ones I am familiar with).

Also worth considering if you are able to reduce your data size at all. Any old or stale date in your 100GB?

answered on Stack Overflow Dec 16, 2019 by user3091170
0

I am able to solve above use case by using presto,hive on aws emr.

With help of hive we can create table on data in s3, and by using presto and hive as a catalog we can query this data. Found out that Presto on emr is way too faster than compared to aws athena (strange that athena uses presto internally)

 create table in hive:-
        CREATE EXTERNAL TABLE `test_table`( 
         `field_name1` datatype,
         `field_name2` datatype,
         `field_name3` datatype
        )
        STORED AS ORC
        LOCATION
          's3://test_data/data/';
        
 query this table in presto:-
        >presto-cli --catalog hive
        >select field_name1 from test_table limit 5;
answered on Stack Overflow Jun 30, 2020 by AbhiK

User contributions licensed under CC BY-SA 3.0