Query 100Gb of S3 data in milliseconds

0

I have json data in s3. data looks like

{

    "act_timestamp": 1576480759864,
    "action": 26,
    "cmd_line": "\\??\\C:\\Windows\\system32\\conhost.exe 0xffffffff",
    "guid": "45af94911fb911ea827300270e098ff0",
    "md5": "d5669294f78a7d48c318ef22d5685ba7",
    "name": "conhost.exe",
    "path": "C:\\Windows\\System32\\conhost.exe",
    "pid": 1968,
    "sha2": "6bd1f5ab9250206ab3836529299055e272ecaa35a72cbd0230cb20ff1cc30902",
    "proc_id": "45af94901fb911ea827300270e098ff0",
    "proc_name": "gcxvdf.exe"
}

I have around 100GB of such jsons stored in s3, in folder structure like year/month/day/hour. I have to query this data and get results in milliseconds. query can be like:-

select proc_id where name='conhost.exe',
select proc_id where cmd_line contains 'conhost.exe'.

I tried using AWS Athena and Redshift but both are giving results around 10-20 seconds. I even tried with Paraquet and orc file formats.

Is there any tool/technology/technique which can be used to query this kind of data and get results in milliseconds.

(Reason for response time to be in milliseconds is because I am developing interactive application.)

amazon-web-services
amazon-s3
amazon-redshift
amazon-athena
asked on Stack Overflow Dec 16, 2019 by AbhiK • edited Dec 16, 2019 by a_horse_with_no_name

1 Answer

1

I think you are looking for a distributed search system like SOLR or elastic search (I am sure there are others, but those are the ones I am familiar with).

Also worth considering if you are able to reduce your data size at all. Any old or stale date in your 100GB?

answered on Stack Overflow Dec 16, 2019 by user3091170

User contributions licensed under CC BY-SA 3.0