trænz·lū·sĕns | sensing language: word by word

Dec/09

4

Protected: Figures Tabulator

This post is password protected. To view it please enter your password below:


No tags

Dec/09

4

Figures Analysis

I’ve done some more calculations and I’m finding that for every 30 minutes of processing, we’re actually reading content from files for 3% of the time ( < 1 minute) and writing data to the database roughly 97% of the time ( ~ 30 minutes). This means that if we take 3% of the total time spent reading/writing (1,187,328 seconds), that gives us (39,577.6 seconds), and therefore this means that the actual reading speed is 6,406.29 relationships per second.

The fact that we are writing to a database for 97% of the time seems to indicate that we should attempt to increase that performance so we can take advantage of the reading speed. Another important fact is that the speed of writing data to the database has not changed since the beginning of this instance, which means that database size isn’t (yet) affecting performance. Ideas that I have so far for increasing performance are to (1) use a RAID 0, or (2) use a cluster of databases. Another options might be to optimize database configurations, although I’m not sure how much that would help things… although it may significantly increase performance. I’m no DBA, so I can’t say for sure.

, , , , , ,

Dec/09

4

Sobering Figures #2

The following are slightly more optimistic figures regarding the speed at which we are parsing data:

unique relationships: 66,024,755
total relationships parsed: 253,545,925
relationships per second: 213.54
relationships per minute: 12,812.60
relationships per hour: 768,755.84
unique rel’s per second: 55.61
unique rel’s per minute: 3,336.47
unique rel’s per hour: 200,188.25

, ,

Dec/09

4

Sobering Figures #1

Here are some sobering figures regarding the parsing of Wikipedia. Such a large corpus of text and so much more to go!

total seconds of parsing: 1187328
total files parsed: 31
total files: 564
files remaining: 533
percent completed: 5.50%
minutes: 19788.8000
files per minute: 0.0016
minutes per file: 638.3484
minutes till complete: 340239.6903
hours: 329.8133
files per hour: 0.0940
hours per file: 10.6391
hours till complete: 5670.6615
days: 13.7422
files per day: 2.2558
days per file: 0.4433
days till complete: 236.2776
weeks till complete: 33.7539
months till complete: 7.8759

, ,

Nov/09

24

‘Unfriend’ named word of the year

Nov/09

20

Challenge: Defining Coherence

Over the past few years I’ve tried to find out if the term “coherence” is an appropriate word to use when trying to describe part of TransluSense’s objective. Wikipedia does have an article on linguistic coherence, and it seems to back my perception of what coherence means with respect to language and word usage. As one of TransluSense’s objectives is to build a systematic algorithm for “gauging coherence”, the definition of “Coherence” must be carefully presented. The Wikipedia article (I think) does some justice- simply put: “Coherence in linguistics is what makes a text semantically meaningful.

For the layperson, I’ve considered a text to be “coherent” if a native speaker of a language would agree that the text “made sense”. Once the native speaker is unable to understand the meaning of a sentence, it is no longer “coherent.” And further more, my belief is that there are varying levels of “coherence” as something can “barely make sense” and something else can be much clearer. This is clearly a topic to do lots of research on and make a very specific attempt at defining coherence and how gauge-able it in fact is.

, , ,

Nov/09

20

TransluSense Described

For those of you that are visiting this site for the first time, here is a definition of TransluSense:

TransluSense is a platform of software and information that provides up-to-date information on language usage to 3rd party applications. The concept is that these 3rd party applications require more than the traditional built in grammar tools. They would benefit from obtaining data on the latest language usage patterns that supersede  grammar rules and focus on colloquial language usage and actual word order patterns according to a specific context. The applications of this type of tool range from forensic linguistics to improving Grammar tools in any word processor.

The software platform has been built but it is currently processing data and building its “knowledge base” by reading large quantities of corpora. When the parsing is complete, extensive testing will be performed to determine how well the currently established algorithms work.

,

Nov/09

20

TransluSense Blogging

Welcome to the TransluSense blog. I’ve created this Blog in order to publicly communicate statuses on my R&D regarding TransluSense and related material. I’m a working professional so I devote part of my hobby time to working on TransluSense, hopefully I can get a significant amount of blogging done so that I can log what goes through my head.

No tags

Find it!

Theme Design by devolux.org