tag graph

I made a network graph of the tags on all of my blog posts, using vis.js. I thought there might be some interesting clustering that showed up, but... not really. And it's incredibly slow. Can any of you improve it?

Tags are bidirectionally unidirectionally linked if they have appeared in the same post together. Link weight is how many times. Size and mass are the total number of occurrences of the tag.

Tags: , ,

26 Responses:

  1. I thought you meant it was slow to produce, but yeah, slow to display. Can't say exactly how slow since I backed out after a couple minutes. Too bad, I was hoping to examine the tentacles node.

  2. Here's my take. Runs way faster than the vis.js version, looks pretty, but dunno how useful it is. There's just too much cross-linking between topics for meaningful clusters to form, I think. https://baicoianu.com/~bai/janusweb/test/mindmap-jwz.html

    • jwz says:

      My over six thousand ad blockers apparently, even when given the thumbs up, won't let that do anything but play for me some smooth jazz.

      • James Baicoianu says:

        Weird. Definitely no ads on the page. Should run in any vaguely-modern browser. Are you seeing anything at all?

        I'll try popping it on a proper host somewhere tomorrow rather than just dumping it in public_html, we'll see if that helps.

        • jwz says:

          I did not try other browsers yet, but basically after I click "allow" on several things and still nothing works, I just assume that some or all of whatever thousands of mapreduced libraries are involved require both my blood type and an entire kidney before I can receive candy, and then I just stop trying.

          "Accept our cookie policy or we just can't guarantee your grandmother's safety."

          The smooth jazz was very soothing, though.

      • Dan says:

        I also run with aggressive ad blocking, and the site opened up immediately for me on Chrome/Windows.

  3. If you’re looking for clusters in all the blog posts you have, there might be different approaches to take. Might do topic modeling on the actual blog posts, aka machine tagging, then assign a score to each post based on how it aligns to those topics, and then run a clustering algo (K-means etc) on those scores as a vector. That will get you somewhere interesting.

    Permission to scrape blog posts? Won’t have time to do until work week, but I have some existing pythons that do similar to the above and I’d be curious to see how it performs.

  4. Daniel Abel says:

    Oh, a topic I can actually help you with!

    You ran into the classic problem of "most real-world networks look like a hairball at first glance". Teasing out meaningful structure is often quite complicated.

    The most often used approach is to threshold the network: since the obvious aspect of the hairball is that there are many edges, simply lowering the number of edges should help. In your case, you appear to have 97 nodes (tags) and 2478 edges. (A technical note: an additional cause of the slow display of your visualization might be that all edges are doubled: since you compute them based on the co-occurrence of the tags, you should get an undirected network, but you use bidirectional links, effectively doubling the number of links)

    So, if you throw out the weak links, and only keep the strongest (and thus most significant) ones, the network should look much better. Using only the 99 strongest links (those with a weight larger than 125), I get the following:

    However, this network only ends up having 41 nodes, i.e. doing this thresholding loses a lot of nodes: if a node does not have at least one strong edge won't have any left and thus are dropped. (Keeping them as isolated nodes is also an option, but not much better.)

    One option is to do the thresholding "locally": simply keep, say, the 2 strongest edges for each node. This is somewhat questionable from a theoretical viewpoint, but often gives pretty good results. For your network, the result looks like this:

    This keeps all nodes, at the cost of making the meaning of the network somewhat strange (since it is no longer the "strongest edges").

    However, there is another question that comes up: the meaning of the weights. If I understand correctly, the edge weights are simply the number of blogposts the two tags appeared on together. This will distort the network, since common tags will end up having stronger connections, simply due to appearing frequently. A more reasonable edge weight definition might want to correct for the individual popularity of the tags, and instead of the "number of common blogposts", you might want to calculate something like "how much more likely that these two tags appear together compared to if the tags were placed randomly". Doing this calculation changes the edge weights, resulting in (using a local thresholding, as before) something like this:

    Note that the actual threshold values used (125 for the first network, keeping 2 edges for each node for the other two) is, of course, a parameter to be tuned -- as is the placement of the nodes. (Often automatic layout algorithms give less-than-optimal results. For small networks such as this one, manual fine-tuning of the node placement is somewhat reasonable, although this also depends on what visualization tool you use since some only do automatic placement. For fiddling with network visualizations, the open source program cytoscape is quite good, which is what I used for these images, although I did the thresholding and computation stuff in python.)

    • tb says:

      Yes, I also noticed that all the edges are duplicate, and tried to threshold the graph by changing the code like this:

      var cutoff = 10;
      function checkValue(v) { return v.from < v.to && v.value >= cutoff; }
      var data = {
      nodes: nodes,
      edges: edges.filter(checkValue)

      Interesting values of cutoff were somewhere between 10 and 100, and the reduced number of edges also made the speed tolerable.

    • Ooh, thank you for writing this up. Bookmarking this for next time I'm having trouble getting a graph to look like something other than a hairball. ❤

  5. Zach says:

    Visualizing the adjacency matrix may be easier. You can reorder the rows/columns based on similarity to see how everything clusters.

  6. jwz says:

    I changed it to omit tags that were used fewer than 10 times, and to omit inter-tag-pairing links that occurred fewer than 15 times. I also switched the layout algorithm from "barnesHut" to "repulsion". It's a little easier to read and performs better, but I was still hoping to see better clustering. E.g., the tag "scene missing" occurs exclusively on posts that also have the tag "mpegs", so ideally those would be right next to each other.

    If you make a copy of it and set "configure: enabled" to true you get a bunch of sliders to tweak the settings.

    • Jim says:

      You have more interconnectedness than allows for meaningful clustering, at least by 3d-force-graph's nodeAutoColorBy(), whatever that is. codesandbox.io/s/magical-grass-pnf8l

      Sorry I couldn't figure out the tags you took out which are needed to render, so I replaced them with "?" -- probably better to take them out along with all the links that reference them. The arc width is the log of the link value. There are a ton of fancy rendering options I didn't try for curved arcs etc. I tried to use the Viridis colormap and font size for the node values, but I'm not sure that's working or well-considered.

        • keith says:

          BTW, I took you at your word that these links had direction but I think your graph is "undirected." If this page was a "firstperson" page that links to "meta" and "www" then you have direction, but, if this page is tagged equally with "firstperson," "meta," and "www" tag then there isn't any direction. Some algorithms care, others do not. I'm going to repost taking out the directionality.

        • keith says:

          You changed the data in the original post. boo. You took out nodes but you didn't take out the links for those nodes. It's best to leave all the data in and then filter out the weak links based on low centrality or pagerank scores. Just because a node doesn't have a lot of links to/from it doesn't mean it's not important. Any chance of getting the original data set? Or better, a live data set?

          • jwz says:

            Ok, I pointed all the links forward, and the pruning happens in JS now, so you have all the raw data.

            • keith says:

              Thanks - republished with data as of this morning.

              • Alex says:

                As so often with network analysis, the metrics are more interesting than the graph. "conspiracies", "the future", "art", and "corporations" are all much more central (top 10 for eigenvector centrality) than they are pageranked (between 10th and 17th), implying they are the kind of high-multiplier actors this stuff is meant to identify. jwz is an artwork about corporate conspiracies and the future, which is...right? also the most important links are the ones between "doomed" and "tron", "firstperson" and "katrina", and "movies" and "diebold", which also seems intuitively correct.

                • keith says:

                  Including the data with the graph lets both sides of the brain grok the data in their own way. I'm disappointed "teeth" didn't earn a higher rank.

                  It'd be interesting to weight the links with the number of comments (or sum bytes of comments) for each post. I think that would "bring up" the tags that provoke reactions.

        • Jim says:

          I love your work, and especially the table rows. Do you think the rows could have pie charts and thumb sliders?

          • keith says:

            Thanks! I was just thinking it'd be cool to make the nodes little pie chart like things displaying the ratio of different communities it's connected with. Also highlighting the edges/nodes in the graph as you mouse over them in the table. Anyways hit me up in email for what you had in mind so we don't overstay our host's welcome. email is my github username @gmail.com.