Text Mining - Working with Twitter Data

Social media presents one of the most interesting and timely data sources for crowd-sourced decision making and has been widely accepted as a data source for many applications. In this tutorial, we are going through the most important steps of exploiting Twitter data.

Before we begin

In order to follow this tutorial interactively, you need to setup your computer a bit. You will need

  • a copy of jq in your path https://stedolan.github.io/jq/download/
  • most preferably a bash shell and basic unix tools (mingw or git shell for Windows)
  • the following data files downloaded into a folder of your computer sample-tweet.json and tweets.json.
  • optional: QGIS if you have downloaded twitter data to visualize it geographically
  • optional: python, pip, and tweepy if you want to stream twitter data yourself

Mining Twitter Data

Preparing for API Access

Twitter provides a nice and clean API and the first thing you will need is, well, a Twitter account. Then, as of July 2018, you must apply for a Twitter developer account and give some information on how you want to use the Twitter API, for example on Twitter Apps Then, you need to create an app which provides you with credentials to use the API. As this process is changing over time, just find it on Twitters own web pages.

For continuing along this tutorial, you will need a set of keys, namely

  • The Consumer Key (API Key)
  • The associated Consumer Secret (API Secret)
  • An Access Token
  • An associated Access Token Secret

Each of these is a long alpha-numeric string and for simplicity, we will store them in an environment file called secret.env of the following format:

#Access Token
TWITTER_KEY=274[...]M9b
#Access Token Secret
TWITTER_SECRET=WKS[...]1oI
#Consumer Key (API Key)
TWITTER_APP_KEY=8Co[...]Plt
#Consumer Secret (API Secret)
TWITTER_APP_SECRET=cEI[...]net

Later, we will discuss why this is a good way of dealing with these values…

Accessing the Streaming API via Python

There is an awesome, small python library tweepy which helps you with accessing the API.

The basic structure of tweepy is to implement a certain python class whose methods are being called on novel tweets becoming available through the API.

The first step is to configure authentication, to authenticate, to register the class that shall handle tweets, and to start streaming (for a certain bounding box, for example).

In the following snippet, the class StreamListener is a class implemented by ourselves.

auth = tweepy.OAuthHandler(os.environ['TWITTER_APP_KEY'],os.environ['TWITTER_APP_SECRET'])
auth.set_access_token(os.environ['TWITTER_KEY'], os.environ['TWITTER_SECRET'])
api = tweepy.API(auth)
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)

stream.filter(locations=[-180.0,-90.0,180.0,90.0])

The first two lines present the authentication data to the system and we can use the environment to transport these. While this is not super-secure, it is very convenient together with the file above and with Docker, which we will be using to handle errors. The following lines register an instance of our self-defined class StreamListener to the API and, finally, the call to stream.filter will start streaming tweets into our instance stream_listener of our clas StreamListener

On a typical Linux machine, you can first load the environment file and then just run the python file containing this snippet, like

$> source secret.env
$> python my-streamer.py

But now, where do we actually get the tweets from? Well, we need to implement the class StreamListener. The following gives an example:

class StreamListener(tweepy.StreamListener):
    def __init__(self):
        self.outfile = open('tweets.json',"a+")
        tweepy.StreamListener.__init__(self);
        
    def on_status(self, status):
        tweet=json.dumps(status._json)
        print(tweet, file=self.outfile)
            
            
    def on_error(self, status_code):
	   ... add proper error handling (like throwing an uncaught exception ;-)...

This completes to a minimal working example of using Tweepy, a more detailed explanation can be found in the tweepy documentation:

Wrap-Up and Assignment

Okay, this completes the first major step towards mining twitter data. If you combine the things you need to do by hand and the python snippets, you will end up with a system that is able to follow the public Twitter API stream and writes each tweet as a JSON object into a file. That is, the file contains one JSON object after another, each JSON object describing one tweet.

In fact, you will have created three files (after running the example):

  • secret.env containing your API key information
  • main.py a streaming API client based on tweepy
  • tweets.json the result of mining a little bit around.

Advanced: Expecting Errors is Better than Handling Errors

Mining tweets over a long time is tedious, because we have to talk about errors like API problems, network faults, disk full, etc… In general, it is a good strategy to enumerate all errors and to think about handling each of them. However, this is rather impossible for a non-standard API, a quickly evolving open-source library, and the Internet with all of its complexity. The second-best strategy is to fail fast. In fact, we don’t care about errors except for the fact that they quickly stop the running application (e.g., throw on all you know, don’t catch exceptions (we are not Java kids, are we?), additional exceptions you were not thinking of like disk full will end the program themselves). In short, we want every small error to quickly exit the program. And then, we start over from scratch, don’t we?

A bash solution to this would look like this:

while true; do
  python main.py
  sleep 1s;
done

Though it is very simple and stable, it has two downsides: first, more than one second of downtime for each error might be too much and, second, permanent errors lead to permanent invocations possibly hitting on rate limits on the remote API.

Docker provides support for restarting containers with many options and allows you to have almost zero downtime yet it will, for exapmle, gradually increase the delay. The most interesting fact is that as the docker daemon starts on system boot, even when you reboot your computer (like pressing the reset button) it will quite quickly (during the boot process) launch up your twitter container.

Another option is to go for advanced process managers like systemd or supervisor taking care that your python script is running almost always and, especially, how to inform you should it not be working.

Twitter Data Objects

The Twitter API provides a set of fields for each tweet, the following image of a rendered tweet was copied from the current API and commented

Basically, each tweet consists of text, multimedia, hashtags, metadata (like number of retweets, date of sending it), and information about the user sending the tweets. How does this now actually look like for this very tweet? Well, here it is:

{
  "contributors": null,
  "truncated": true,
  "text": "The Shortest Paths Dataset used for #acm #sigspatial #gis cup has just been released. https://t.co/pzeEleBfu9 #gis… https://t.co/IF7z1WnUDk",
  "is_quote_status": false,
  "in_reply_to_status_id": null,
  "id": 1062405858712272900,
  "favorite_count": 3,
  "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>",
  "retweeted": false,
  "coordinates": null,
  "entities": {
    "symbols": [],
    "user_mentions": [],
    "hashtags": [
      {
        "indices": [
          36,
          40
        ],
        "text": "acm"
      },
      {
        "indices": [
          41,
          52
        ],
        "text": "sigspatial"
      },
      {
        "indices": [
          53,
          57
        ],
        "text": "gis"
      },
      {
        "indices": [
          110,
          114
        ],
        "text": "gis"
      }
    ],
    "urls": [
      {
        "url": "https://t.co/pzeEleBfu9",
        "indices": [
          86,
          109
        ],
        "expanded_url": "http://www.martinwerner.de/datasets/san-francisco-shortest-path.html",
        "display_url": "martinwerner.de/datasets/san-f…"
      },
      {
        "url": "https://t.co/IF7z1WnUDk",
        "indices": [
          116,
          139
        ],
        "expanded_url": "https://twitter.com/i/web/status/1062405858712272898",
        "display_url": "twitter.com/i/web/status/1…"
      }
    ]
  },
  "in_reply_to_screen_name": null,
  "in_reply_to_user_id": null,
  "retweet_count": 0,
  "id_str": "1062405858712272898",
  "favorited": false,
  "user": {
    "follow_request_sent": false,
    "has_extended_profile": false,
    "profile_use_background_image": true,
    "default_profile_image": false,
    "id": 2744619733,
    "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
    "verified": false,
    "translator_type": "none",
    "profile_text_color": "333333",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/738828320943525888/sx-cu2LT_normal.jpg",
    "profile_sidebar_fill_color": "DDEEF6",
    "entities": {
      "url": {
        "urls": [
          {
            "url": "https://t.co/74ySSExk6l",
            "indices": [
              0,
              23
            ],
            "expanded_url": "http://www.martinwerner.de",
            "display_url": "martinwerner.de"
          }
        ]
      },
      "description": {
        "urls": []
      }
    },
    "followers_count": 42,
    "profile_sidebar_border_color": "C0DEED",
    "id_str": "2744619733",
    "profile_background_color": "C0DEED",
    "listed_count": 7,
    "is_translation_enabled": false,
    "utc_offset": null,
    "statuses_count": 116,
    "description": "",
    "friends_count": 54,
    "location": "",
    "profile_link_color": "1DA1F2",
    "profile_image_url": "http://pbs.twimg.com/profile_images/738828320943525888/sx-cu2LT_normal.jpg",
    "following": false,
    "geo_enabled": false,
    "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
    "screen_name": "trajcomp",
    "lang": "de",
    "profile_background_tile": false,
    "favourites_count": 76,
    "name": "Martin Werner",
    "notifications": false,
    "url": "https://t.co/74ySSExk6l",
    "created_at": "Tue Aug 19 10:55:32 +0000 2014",
    "contributors_enabled": false,
    "time_zone": null,
    "protected": false,
    "default_profile": true,
    "is_translator": false
  },
  "geo": null,
  "in_reply_to_user_id_str": null,
  "possibly_sensitive": false,
  "lang": "en",
  "created_at": "Tue Nov 13 18:04:29 +0000 2018",
  "in_reply_to_status_id_str": null,
  "place": null
}

We can now go through each and every aspects of this tweet, but we want to do it rather coarse now, as we will discuss aspects of those fields anyways when learning how to handle, filter, and organize many such tweets. But some things are worth realizing:

  • each tweet has a unique 64 bit unsinged ID given as an integer (field id) and as a string (field id_str)
  • each tweet has a timestamp created_at and though I created this tweet in Germany (GMT+1), it is stored in UTC (GMT+0) time zone. All tweets share this timezone. In this way, it is very easy to relate tweets to each other on a global scale, but more difficult to relate a tweet to the local time of day.
  • the language is estimated by twitter
  • the whole user account is embedded into the tweet. This is highly redundant, but very useful for web performance: A tweet object is designed to be sufficient to render the tweet with Javascript (e.g., create the view shown above).
  • hashtags are isolated
  • a field truncated has been introduced for compatibilty: when Twitter changed away from the short 140 character tweets to longer tweets, they made all APIs return a truncated version of all tweets that is short enough for the old API guarantee. If it is truncated, the field truncated tells us. In addition, the tweet might contain an additional field full_text, however, with different API options prooving that my client was aware of this new feature.

Working with Twitter Data (or JSON in general)

JSON stands for JavaScript Object Notation and has become one of the central data representations on the Internet. It is extendible, human-readable, and easy to write. It can be read by all major programming languages and has been around for a long time in the context of RESTful services.

However, the great flexibilty and human readability made it quite difficult to work with this type of data without programming. But, nowadays, it has become possible by a team of enthusiasts building a small program jq which is also known as the sed for JSON data. You can slice, filter, map, reduce, transform, extract, visualize, compress, and generate JSON objects.

The only downside of JQ is that it had to be so powerful and therefore looks complex at a first glance. This is, why we will be learning it example by example.

The typical representation of JSON for JQ is as a sequence of JSON objects in a file. By sequence, we mean that each JSON object follows the next, probably separated by some whitespace characters (spaces, tabs, newlines, whatever). However, jq also proposes and supports a slightly more structured format in which sets of JSON objects are represented as a text file in which each line contains exactly one JSON object.

Of course, we will adopt this idea, because then we can, for example, count tweets using wc -l or we can split very large files into smaller ones for processing using the standard split --lines=2000 program. In other words, it is then compatible with all the usual smart command line tools of Linux.

Following JQ naming, this is called compactified JSON.

In general, JQ is just a small command line utility jq which is intended to be used on pipes or on files just like grep or awk.

First steps with JQ

The first thing to know about jq is that it can be used to show JSON files with pretty-printing. In general, the first argument is always a query, we will learn about those later, the second argument might be files. However, data can also be read from standard input. The trivial query not changing the JSON object at all is represented by a dot (.) and is used in the following examples:

jq . tweets.json
cat tweets.json | jq .

Now it is already time to learn some strange things about Linux. When you use the previous two statements, you see a coloured and pretty-printed representation of the tweet. However, it is usually too long to fit into your terminal window. Therefore, you might be tempted to use the less pager like this

cat tweets.json | jq . | less

Now, jq is smart enough to see that it does not write to a terminal session (it writes to another program, namely less). Therefore, it does not colorize anymore (as colors are additional so-called escape characters and might lead to misbehaviour of the follow-up program. But, there is a switch to force coloring and using it leads to the following:

$> cat tweets.json | jq -C . | less
ESC[1;39m{
  ESC[0mESC[34;1m"contributors"ESC[0mESC[1;39m: ESC[0mESC[1;30mnullESC[0mESC[1;39m,
  ESC[0mESC[34;1m"truncated"ESC[0mESC[1;39m: ESC[0mESC[0;39mfalseESC[0mESC[1;39m,
  ESC[0mESC[34;1m"text"ESC[0mESC[1;39m: ESC[0mESC[0;32m"And it has been an amazing experience, again... https://t.co/0IKyTlhQOW"ESC[0mESC[1;39m,
  ESC[0mESC[34;1m"is_quote_status"ESC[0mESC[1;39m: ESC[0mESC[0;39mtrueESC[0mESC[1;39m,

Not quite what we want. Less is usually not interpreting escape characters as this has been a security issue for a long time. But, if we know force less to interprete these escape characters, we end up with what we wanted:

We can use the up and down key to navigate a coloured and pretty-printed version of the tweet and the key q gets us out of this view.

First steps in JQ

Value Expression

As already said: jq provides a flexible language to transform tweets. Let us start with the smallest entity of this cool language: an expression The most basic expression in programming languages are values and this is the same for the jq query language. In the following, we will pipe our input data to jq, because jq is supposed to have input data and will not work well without any data. Note that the first sections here ignore the input completely.

icaml$ cat sample-tweet.json | jq true
true
icaml$ cat sample-tweet.json | jq false
false
icaml$ cat sample-tweet.json |jq 1.42
1.42
icaml$ cat sample-tweet.json | jq '"this is astring"'
"string"
icaml$ cat sample-tweet.json | jq '"this is a string"'
"this is astring" 

This illustrates that an expression evaluates to itself (e.g., true and false to a boolean value, 1.42 to a number, strings to strings). It is important to note that the JSON equivalents of these values are identical with the only difference that strings get quotation marks. And as quotation is so important when using JQ in a shell (it is easy to get interference from shell quoting and JQ quoting), we basically always use two ticks ‘’ to let those strings be unaltered by the bash shell. Inside these marks, we can freely use quotation marks “ to format for jq.

Creating Objects and Arrays

Basically, JSON supports two higher-order data types: JSON objects and JSON arrays. While JSON objects contain key-value pairs in which keys are always strings, JSON arrays are just ordered sets of data. Lets create some in JQ. This is done, again, by writing them as they are:

icaml$ cat sample-tweet.json | jq '{"key1":42,"key2":"a string"}'
{
  "key1": 42,
  "key2": "a string"
}
icaml$ cat sample-tweet.json |jq '[1,2,3,4]'
[
  1,
  2,
  3,
  4
]
icaml$ cat sample-tweet.json | jq '{"array":[1,2,4,8],"2d array":[[1,2],[3,4]],"nested objects":{"key":"value"}}'
{
  "array": [
    1,
    2,
    4,
    8
  ],
  "2d array": [
    [
      1,
      2
    ],
    [
      3,
      4
    ]
  ],
  "nested objects": {
    "key": "value"
  }
}

Extracting fields - the dot operator

One of the most frequent operations is to extract information from a JSON file. For example, we want to know the IDs we have in a file. Extraction can be done with the . operator as follows:

icaml$ cat tweets.json |jq '.id_str'
"1062406263932444672"
"1062405858712272898"
"1036898465270444032"
"1034516701235372032"
"1027811999529529344"
[...]

We can also chain this operator by applying it to the result. This means, if we use a dot expression to match on object, we can directly use another dot expression to match an element. Less theoretically, we can write

icaml$ cat sample-tweet.json |jq .user.url[]
"https://t.co/74ySSExk6l"
icaml$ cat sample-tweet.json |jq .user.entities.url.urls
[
  {
    "url": "https://t.co/74ySSExk6l",
    "indices": [
      0,
      23
    ],
    "expanded_url": "http://www.martinwerner.de",
    "display_url": "martinwerner.de"
  }
]
icaml$ 

That is, for objects, the . operator suffices to get any value from the hierarchy we might want. If we apply this over a file with more than one JSON object, we get one result per object (just like in the ID case above).

Extracting from arrays, the bracket operator

But now, how do we deal with arrays? There are two ways essentially: accessing the array with an integer or iterating over the array. In order to show the first one, we need to create an array that we want to access indidual elements. Let us create a simple 2x2 matrix with 1,2,3,4 in JSON and use JQ to get the entry at i,j:

icaml$ echo "[[1,2],[3,4]]" | jq '.'
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]
icaml$ echo "[[1,2],[3,4]]" | jq '.[0]'
[
  1,
  2
]
icaml$ echo "[[1,2],[3,4]]" | jq '.[0][1]'
1
icaml$ echo "[[1,2],[3,4]]" | jq '.[1][0]'
2
icaml$ echo "[[1,2],[3,4]]" | jq '.[1][1]'
3
icaml$ echo "[[1,2],[3,4]]" | jq '.[1][1]'
4
icaml$ 

The bracket operator can be used to access a specific element of an array (say the first, second, etc.). Counting starts at zero. If the bracket operator returns an array (as does .[0] in our case), brackets can be chained again leading to the nicely readable expressions like .[1][0].

And of course, brackets can be mixed with dots. Let us retrieve the URL from the profile of the sample tweet. Therefore, we have to access an array (there might be more than one URL). Let us take only the first one.

icaml$ cat sample-tweet.json |jq .user.entities.url.urls[0].display_url
"martinwerner.de"

Okay. This is cool, isn’t it.

But sometimes, we arrays are used to model things that are all important. Now, there is another bracket operator, which basically runs the remaining expression for each of the entries, usually creating an output item. Let us do this for a real array:

icaml$ echo "[[1,2],[3,4],[5,6]]" | jq '.[][0]'
1
3
5
icaml$

This is good. But note that we have one JSON object in the input and three JSON objects in the output. In most real cases, you can help yourself by creating an object that contains identifying information like follows: Let us take our sample tweet and extract all the hashtags.

icaml$ cat sample-tweet.json |jq '.entities.hashtags[].text'
"acm"
"sigspatial"
"gis"
"gis"

Now, you can run the same on many tweets:

icaml$ cat tweets.json | jq '.entities.hashtags[].text' 
"acm"
"sigspatial"
"gis"
"gis"
"MyData2018"
"SpatialComputing"
"GISChat"
"DataScience"
"tutorial"
"Spark"
"AWS"
"Docker"
"spatial"
"analytics"
"DataScience"

But, how do we now tell which one came from which tweet? One approach is to create output objects with two queries (queries can be completely independent). More concretely, consider

icaml$ cat sample-tweet.json | jq '{"id":.id_str, "hashtag": .entities.hashtags[].text}' 
{ 
  "id": "1062405858712272898",
  "hashtag": "acm"
}
{
  "id": "1062405858712272898",
  "hashtag": "sigspatial"
}
{
  "id": "1062405858712272898",
  "hashtag": "gis"
}
{
  "id": "1062405858712272898",
  "hashtag": "gis"
}

This uses two queries in one object construction (cool, eh) thereby creating an object with the id and the hashtag. When we run this over the large file, the id will still reflect from where we took the hashtag.

There is, however, an additonal convenience function converting between objects and arrays of key-value pairs (themselves being objects). However, we need pipes to use it. So look ahead if you are impatient.

Calculating with JQ

JQ would not be as powerful as it is if it could not compute. And with computation it is rather smart. Lets start with simple things. Again, we have to feed a valid JSON object to start the machinery, the echo “[]” part is nothing else than this.

icaml$ echo "[]" | jq 1+2
3
icaml$ echo "[]" | jq '"hello " + "world!"'
"hello world!"
icaml$ echo "[]" | jq '[1,2]+[3]'
[
  1,
  2,
  3
]
icaml$ echo "[]" | jq '{"key":"value"}+{"key2":"value2"}'
{
  "key": "value",
  "key2": "value2"
}
icaml$ 

As you can see, addition has been defined for all types we have seen so far and - in general - all arithmetic operators in JQ do the most probable thing that you meant and can fail, if you do things you should not do:

icaml$ echo "[]" | jq '{"key":"value"}+[1,2,3]'
jq: error (at <stdin>:1): object ({"key":"val...) and array ([1,2,3]) cannot be added
icaml$ 

But be careful, sometimes it is not at all obvious how the developer decided. For example, adding objects is possible even with duplicate keys. However, addition does not add the values of the keys (as you might think); instead, it takes the last value:

icaml$ echo "[]" | jq '{"key":"value"}+{"key":"value for duplicate key"}'
{
  "key": "value for duplicate key"
}

Brackets: Making complex expressions simple

The next complexity is that you can make multiple and complex expressions to behave like a single one and it works as with numbers and brackets:

icaml$ echo "[]" | jq '"x"+"y"*2'
"xyy"
icaml$ echo "[]" | jq '("x"+"y")*2'
"xyxy"

In this example, you see that in the first case, the multiplication is only applied to the expression left to it, namely to y. If, however, we use brackets, this binds "x"+"y" into one expression and the operator operates on both.

Many expressions - The Comma Operator

Okay, now the comma operator is useful: If you have multiple expressions (e.g., queries), you can let them run at once by putting a comma. The output is concatenated. That is, when the first query generates k outputs and the second query generates l outputs, the comma-connected operator generates k+l results. For example:

icaml$ cat sample-tweet.json |jq '.id_str, .text'
"1062405858712272898"
"The Shortest Paths Dataset used for #acm #sigspatial #gis cup has just been released. https://t.co/pzeEleBfu9 #gis… https://t.co/IF7z1WnUDk"
icaml$ 
icaml$ echo '{"key":"value", "array":[1,2,3,4]}' | jq '.key,.array'
"value"
[
  1,
  2,
  3,
  4
]

The pipe operator - Streaming Queries

As already discussed before, JQ allows for having many independent queries. Usually queries depend on the original input. However, we might want to have queries depending on the output of a filter and this is exactly, where the pipe operator comes into place. It basically runs the query left from the pipe and feeds the results into the right-hand side query of the pipe.

For example, we can first select the user subobject and then operate on it:

icaml$ cat sample-tweet.json |jq '.user | .name'
"Martin Werner"
icaml$ 

In this snippet, the output of the filter .user is the complete user object. Applyng .name to it gives the name.

Some Functions

JQ would not be as powerful if it would only allow to extract and create data. Instead, JQ contains a lot of functions and you can define some yourself. However, let us first start with some utility functions you don’t want to miss on your first steps with JQ:

Length (of a string or array)

Returns the length of an object, string, or array.

icaml$ cat sample-tweet.json |jq '.user.name | length'
13

Keys

Returns the keys of an object (indices for arrays)

icaml$ cat sample-tweet.json |jq '. | keys[]'
"contributors"
"coordinates"
"created_at"
"entities"
"favorite_count"
"favorited"
"geo"
"id"
"id_str"
"in_reply_to_screen_name"
"in_reply_to_status_id"
"in_reply_to_status_id_str"
"in_reply_to_user_id"
"in_reply_to_user_id_str"
"is_quote_status"
"lang"
"place"
"possibly_sensitive"
"retweet_count"
"retweeted"
"source"
"text"
"truncated"
"user"
icaml$ 

Doing Map-Reduce in JQ

The MapReduce paradigm originates in functional programming and is very useful for stream-oriented programming as with JQ. It consists of two aspects: map, which basically applies a function (e.g., a filter) to each of the elements of a collection and reduce which can be used to reduce a collection to a single object (or fewer objects).

The Map function

Map is a very natural function in JQ. Look at the following canonical example:

icaml$ echo [1,2,3,4] | jq 'map(.+1)'
[
  2,
  3,
  4,
  5
]
icaml$ 

This just takes the array input and applies the operator inside the map function to each of the elements forming a new object of the results. That is, each element of the result is the result of the filter applied to the corresponding element of the input.

In this way, we can calculate a few square numbers:


icaml$ echo [1,2,3,4] | jq 'map([.,.*.])'
[
  [
    1,
    1
  ],
  [
    2,
    4
  ],
  [
    3,
    9
  ],
  [
    4,
    16
  ]
]
icaml$ 

But how do we go along with objects? Well, there is another function map_values that does what you would expect on objects:

icaml$ echo '{"key":"value","key2":"value2"}' | jq 'map_values(.+"_")'
{
  "key": "value_",
  "key2": "value2_"
}
icaml$ 

Entries: to_entries, from_entries, with_entries

Sometimes, you might want to work not only on the values, but rather on the key value pairs. For example:

icaml$ echo '{"key":"value","key2":"value2"}' | jq 'to_entries'
[
  {
    "key": "key",
    "value": "value"
  },
  {
    "key": "key2",
    "value": "value2"
  }
]
icaml$ 

This allows you to do things like this:

icaml$  echo '{"key":"value","key2":"value2"}' | jq 'to_entries | map(.key+" is "+.value)'
[
  "key is value",
  "key2 is value2"
]
icaml$ 

There is a reverse function from_entries and a convenience function with_entries doing the reverse operations respectively the chain from_entries | map(...) | to_entries with which you can operate on each entry of an object transforming it from a key value pair to another key value pair.

Type Conversions

We have already seen that some operations expect certain types. For example, JSON keys must be strings and the + operator does not concatenate a string and a number. In JQ, type conversions are functions and typically used together with the pipe | and brackets. For example:

icaml$  echo '{"summand1":42.1,"summand2":39}' | jq '(.summand1|tostring)+"+"+(.summand2|tostring)+"="+((.summand1+.summand2)|tostring)'
"42.1+39=81.1"
icaml$ 
icaml$ echo "1.234" | jq "tonumber"
1.234
icaml$ 

Automatic Formatting

Sometimes, it is useful to create other formats from JSON, most notably CSV variants. Some of the most important such output formats are defined and can be used directly, for example

  • @text: calls tostring
  • @html: escapes literals < and >
  • @csv: creates comma-separated values (CSV)
  • @tsv: creates tab-separated values (TSV)
  • @sh: does shell escaping such that the output is proper for giving it on a shell to another program
  • @base64: creates a base64 representation
  • @base64d: decodes base64 representation assuming UTF8 strings
  • @uri: encodes for use in URLs

More concretely:

icaml$ echo '{"search":"where is waldo?"}' | jq '"https://www.google.com/search?q="+(.search|@uri)'
"https://www.google.com/search?q=where%20is%20waldo%3F"
icaml$ 

or even (extending the example from above computing square numbers):

icaml$ echo [1,2,3,4] | jq -r 'map([.,.*.]) | .[] | @csv'
"1,1"
"2,4"
"3,9"
"4,16"

Raw Output

It is important to note that jq always outputs JSON. That is, like in the previous examples, even collections of strings are getting escaped as JSON strings. Especially with @csv, this is not what you typically want. Luckily, there is a command line flag telling JQ that it can output raw results (without JSON encoding). Then, the previous example becomes:

icaml$ echo [1,2,3,4] | jq -r 'map([.,.*.]) | .[] | @csv'
1,1
2,4
3,9
4,16

This can readily be used to create a nice CSV file using shell pipes.

Author: Artem Leichter
Last modified: 2019-01-14