TechWeek News

Sample

11 May 2018

Q&A from ODBMS.org with Thomas Kalippke: Analysing 1.4 Billion dataset with CortexDB.

Cortex AG

Q1. Your current showcase is a database with 1.4 Billion dataset (~1.2 TB data file) with taxi trips of New York using the TLC Trip Record Data . What kind of data discovery do you do with such data?

That’s right. We use the data from all taxi rides collected in New York over nine years. So far we do not use an individual development that was adapted to this data, but run the data discovery with our standard tools on a laptop (Mac Air) and a USB SSD disk (Samsung T3 with 2 TB).

The first idea was to use this data to show individual functions of our database technology and our platform. We already knew about other people’s analyses before the import and we were considering creating similar graphical and aggregated information. After the import and a first look at the data, we found that it is much more interesting and exciting to show the different values per field and perform on-the-fly analyses in our 6th normal form.

In the first step, we only looked at the different contents. For example, incorrect data in each field can be identified at a mouse click without the need to implement an algorithm. The data shows that many credit card transactions contain a negative tip. Interestingly, tips are only recorded for credit card payments.

The data also include the geo-coordinate of the start and end of the trip. Above this we could see that there were some journeys that supposedly once led around the earth. Since this effect has been repeated, we assume that it is a test of the taximeters.

After examining the individual fields and their contents, we proceeded combinatorially and randomly selected a few data records manually. This means that we have used our standard search to combine quantities in order to test whether correlations can be recognized. This led us to the idea of using pictures of celebrities shot by paparazzi. That’s how we knew who got in a cab, where and when. It was therefore possible to combine this information and find the driving exactly in the transaction data.

Q2. What technologies do you use for data discovery and analytics?

We only use our own technology for this showcase. The CortexDB forms the core and the tools are all combined under the CortexPlatform. It is possible that we also use other tools from other vendors via our APIs, but we want to show what is possible in the standard version on a laptop with USB disk. Of course, other developers can also use other tools on our database platform and we are very curious what will come out and how other people will approach this data.

The data is completely on an external SSD USB disk and we use it on a Mac Air laptop. However, we did the import on a small server (with 16 cores and 64GB RAM, the import took about 3 hours and the reorganization into the 6th normal form took another 4 hours). In contrast to other solutions based on this data, we can select all contents of all fields of all data records in any combination and bypass them relatively playfully.

The CortexDB forms the basis, so that we can work playfully with this data on a laptop, with the combination of a document-store for data set storage and a key/value store for indexing per field content (6th normal form). We therefore combine the flexibility of a schemaless approach with a defined scheme for an index per field content (redundancy-free).

For data discovery we use our web-based application CortexUniplex in combination with the server-side JavaScript. We can go through the data manually, but we can also run scripts to check values.

For the analysis we use the integrated functions to analyze the contents of each field (e.g. we immediately see that the sum of 80% of all tips were paid by 1$, 2$ or 3$); on the other hand we also use other of our tools to create aggregated information and filter it as desired. This is the so-called pivot server with which we calculate the results for any combination and display them graphically. The source is known to each result in the transaction data. If a transaction changes, only the affected results are regenerated.

For the graphical representation we use the library D3.js. In this showcase it may not look as nice as the solution from other vendors, but we just want to show the feasibility of new approaches that can be opened up to other developers and departments.

Q3. What results did you obtain from such analysis?

By knowing and showing the contents of every field in every record, we can recommend that everyone should identify possible sources of error and exclude them from aggregated results (data discovery). For example, if the average of all tips is calculated and the negative values are included, the result can only be wrong.

This naturally applies to all analyses in all databases and special areas. If you do not know the possible value set and do not know what actually exists, selections are incomplete and results are incorrect. Therefore it is very easy to look into the field index (6th normal form) with the CortexPlatform or to let it look by algorithm and check the actual values to identify errors.

With reference to this showcase with the taxi data, it was interesting to see that own assumptions and recommendations from travel guides were not correct.

For example, we suspected that most taxi rides take place on certain holidays. But this was not true. On the days of the New York Marathon and the league games in baseball and American football, most people took a taxi.

In addition, many travel guides say that a tip of 30% is appropriate. This is also not true. 80% of all tips are $1, $2 and $3. It is obvious that longer journeys (both in terms of time and distance) make up more tips, but to speak of a lump sum of 30% is wrong.

Interestingly, we were right about one assumption. The best tips are paid on weekend trips when two passengers are carried. Our guess is that men are more generous when accompanied by women ?

Qx Anything else you wish to add?

Meanwhile we provide a free version of the CortexPlatform. It can be downloaded from our website. We will soon be adding more tools to the download area. So everyone would be able to import the same taxi data.

We are also working on publishing examples in github. We also want to exchange ideas and examples with other developers.

Ideas, questions and suggestions are therefore very welcome and we are also very happy to help if we are contacted directly.
—————————-

Thomas KalippkeProduct- and Partner-Management, Cortex AG

View all TechWeek News
Loading

Twitter

Sponsors

VIP-Lounge Sponsoren



 

Innovation Sponsor


 

Platin Sponsoren


 

Gold Sponsoren


 

Silber Sponsoren



 

Registration Sponsor



 

Theater Sponsoren



 

Partners

HEADLINE PARTNER

Reisepartner

Reisepartner


 

EVENT & CONTENT PARTNER




 

MEDIEN & CONTENT PARTNER


 

MEDIEN & Content Partner

Security Education Partner


 

EVENT PARTNER


 

EVENT PARTNER


 

MEDIEN PARTNER


 

MEDIEN PARTNER

MEDIEN PARTNER

MEDIEN PARTNER

MEDIEN PARTNER

 

MEDIEN PARTNER



 

MEDIEN PARTNER



 

MEDIEN PARTNER




 

MEDIEN PARTNER




 

MEDIEN PARTNER




 

MEDIEN PARTNER




 

MEDIEN PARTNER




 

MEDIEN PARTNER




 

MEDIEN PARTNER




 

MEDIEN PARTNER

MEDIEN PARTNER

MEDIEN PARTNER

MEDIEN PARTNER

MEDIEN PARTNER

Partner


 

Medien Partner


 

Security Education Partner


 

Medien Partner


 

Testimonials

  • „Ich komme schon viele Jahre zur Ihren Messen und auch dieses Jahr wurde wieder eine große Auswahl an Vorträgen angeboten, deren Vielschichtigkeit ich sehr wertvoll finde. Man nimmt hier so viele Gedankenanstöße und Impulse mit.“
    Ergo Direkt - Speaker Generation Systems
  • „Insgesamt fand ich die TechWeek sehr informativ und ich würde auch nächstes Jahr wieder teilnehmen. Mein Ziel ist es den Markt etwas zu evaluieren und mir die Anbieter anzusehen. Aber auch die Vorträge fand ich sehr interessant bezüglich neuester Technologien, vor allem um Ideen für die eigene Umsetzung zu sammeln.“
    MDM Deutsche Münze - Head of BI
  • „Auf der TechWeekl können wir gezielt und branchenbezogen Kunden ansprechen und auf diesem Weg ist es natürlich einfacher den Kontakt zum Kunden zu suchen. Für mich ist die TechWeek cool, laut und speziell.“
    Cubeware
  • „Ich finde es sehr überraschend, dass es so breit aufgestellt ist, vor allem, weil ich ohne Erwartungen hierher gekommen bin. Ich habe einen Business Intelligence und Devops Hintergrund und habe hier glücklicherweise viele spannende und für mich relevante Stände entdeckt.“
    Project Manager - Siemens
  • „Die TechWeek ist innovativ, visionär und neuartig. Man kriegt einen Blick in die Zukunft und denkt mit einer gewissen Offenheit.“
    Aviationscouts GmbH - IT Consultant
  • Ich bin schon länger in der Branche unterwegs und kenne mich dadurch natürlich gut aus. Alle, die im Bereich Technologie unterwegs sind und im Markt was zu sagen haben, sind auf der Tech Week vertreten.
    TechTarget GmbH - Sales Director
  • Die TechWeek ist innovativ, gelungen und international. Mann könnte sagen, sie ist eine Art Klassentreffen in der Technologiebranche.
    Corning Optical Communications GmbH & Co. KG - Key Account Manager
  • Wir sind schon seit Jahren dabei und haben auch dieses Jahr natürlich nicht verpasst. Wir treffen hier auf unsere Kunden und Neuinteressenten und treiben auch unsere Produktentwicklung voran.
    1&1 IONOS - Senior Commercial Product Manager
  • Die Messe wird von einem guten Publikum besucht und es werden viele unterschiedliche aber eben auch relevante Themen angesprochen. Für uns ist sehr wichtig, dass Entscheidungsträger anwesend sind mit denen wir interagieren können. Dafür ist die Messe besonders gut geeignet.
    TeamViewer GmbH - Senior Product Marketing Manager Enterprise
  • Wir stellen in ganz Europa auf der TechWeek aus und sind jahrelanger Partner, deshalb sind wir auch dieses Jahr dabei. Die TechWeek zieht ein gutes Publikum an und das war auch immer in den letzten Jahren so. Wir haben hier einen schönen Stand, es ist ein gutes Programm und viele Vorträge.
    NTT Communications - Marketing Director