What do you do with 36,409 places and 6,506 connections? Some cartographic representations of Pleiades data

Two projects that I am involved with, Pleiades and the World-Historical Gazetteer at the University of Pittsburgh, have been devoting considerable time and energy to modeling conceptual places and their connections, so I thought it was worth discussing a few of our observations and presenting some preliminary steps to visualize what we are doing.

First, a somewhat crowded overview of all of the Pleiades data set with map symbols representing different place types.

Figure 1: All Pleiades places

At this level of zoom the map is nearly incomprehensible, but it does reveal some interesting aspects of our data set. The Grid like structure in India and central Asia is the result of “dumping” places for which we have insufficient data into the middle of Barrington Atlas grid squares. For the editorial board such a view is actually quite useful, as it highlights where we need to clean our data and focus on creating better locations.

Another way to show the reach of the Pleiades project is through a choropleth map, which shades different countries according to the number of Pleiades places within them.

Figure 2: Choropleth Map

This is interesting, but I think it gives a fairly misleading sense of Pleiades coverage. From this map a reader would be unable to tell the extent of our data into Russia, China, and other countries where our locations are clustered around certain areas, not evenly spread throughout the country. It does highlight the areas where we have fairly extensive coverage, namely Italy, Greece, and Turkey.

To get around these issues, very often projects like ours use heat-maps to show both the concentration and extent of their data. I find this particular approach to be more aesthetically pleasing than simply throwing all of the points on the map, but due to the nature of a heat-map, I am still not convinced that it accurately depicts the extent of our coverage.

Figure 3: Heat Map

One of my issues with heat-maps is how the colors “bleed” into areas where there are not points. While this can be adjusted and refined by decreasing the radius around each point, if taken too far the heat-map will simply show isolated dots of color instead of the expected continuous whole.

One experiment that I have done is to try and combine heat maps with a Voronoi diagram. The basic idea behind this approach is that the GIS system creates a polygon around each point, and any spot within that polygon is closer to that particular point than any other known point. This helps Pleiades editors, as a “hotspot” in one polygon indicates that there are multiple places “stacked” on one another on the same point, which is a good indication that we are dealing with inaccurate data. Conversely, a “hotspot” that extends through multiple polygons is expected behavior, and signifies that there is a dense cluster of points that are in close proximity but nevertheless still are in distinct locations.

Figure 4: Detail of Voronoi Polygons and a Heat-Map

This is a very aesthetically pleasing map, but it is still difficult to quickly identify the correspondence between points, polygons, and the heat map. Using a hex-bin map (which is essentially a choropleth map with small hex shapes) styled like a heat map perhaps provides the cleanest and most comprehensible view of both our data coverage and density.

Figure 5: Hex-bin map with heat map coloration

Of all the representations mentioned here (and many tests which were far too incomprehensible to show), I believe this map offers by far the best combination of understandability, honesty, and presentation. It clearly shows the concentration of our data in the Mediterranean like a heat map, but does a far better job of showing the precise location of the data points. It also shows a far more honest depiction of the number of points per country and the actual location of those points, which is not the case with a choropleth map at a country scale.

What these maps do not capture is the presence of connections in the Pleiades data set. As part of our evolving data modeling and best practices, we are now experimenting with a more robust system for expressing relationships between different places in our data set. These relationships could be political, geographic, or highly conceptual. One highly interesting product of this approach is that we can start thinking of the Pleiades gazetteer as a description of a network of places, not just as a list of their names and locations.

As a result, it is now possible to graph some of the relationships in our data. This is highly experimental and very incomplete, but I hope that by sharing our first steps in this direction that we can generate some discussion on our approach.

The first thing that I did was to download the Pleiades data set, then extract the connections information, creating a spread sheet that listed each connection as a source – target combination that social network analysis software would understand. Essentially any place that connected to another place was the source, while the place connected to was the target. This was then put into Gephi, where different “communities”, or places with denser connections to each other, are indicated by different colors.

Figure 6: Detail of the Pleiades connections graph

The figure above is a detail of a portion of the resulting graph. You can see communities clustering around regions like Sicily and Sardinia, or around extremely important cities like Rome. The square on the outer reaches of the graph is simply a number of unconnected places that are pushed to the edges by the Gephi visualization software. While this is an interesting and somewhat compelling visualization, it is devoid of any geographic context. Luckily, Gephi has a plugin that places nodes (in our case the places) in a geographic location of there is data available. As we have location data for most of our places, we can use this plugin, which yields the result below.

Figure 7: Pleiades places as a geospatial network

Now we are getting somewhere! The broad outlines of the Mediterranean are visible, as are features like the Nile river and even the outline of India. However, this network is still not on a geographic map (the Gephi globe plugin does not exactly match the coordinate system used by the geography plugin, and also it is based on modern geography), so we are somewhat missing the larger spacial context. Unfortunately there is not an easy way to export the specially enhanced network with Gephi’s statistics and colors – the .kml plugin does capture the color, but lumps all of the statistics into a single description tag.

After some experimentation with exporting, importing, and reexporting in Gephi and QGIS, I finally found a solution by importing the .kml exported from Gephi into QGIS and exporting that as a .csv file which can then be manipulated in OpenRefine to “extract” all of the information from the description field.  From there, the .csv file can be re-imported into QGIS, which results in the visualization below.

Figure 8: Pleiades spatial connections overview map

While somewhat crowded and messy, a closer of Italy view shows the power of this visualization.

Figure 9: Network around Rome without labels
Figure 10: Network around Rome with labels

These visualizations show the networks of connections within a spatial context, and are an intriguing way to approach entities like kingdoms, political entities, or other place groupings. We are already experimenting with placing regions and larger entities (like Sardinia and Sicily) as the “midpoint” between all of their constituent connections, which you can see displayed on the maps above.

However, I want to take this idea one step further and eliminate the representative point entirely from such places. To do so, I decided that a mono modal network, or a network of just one place type, would be an interesting way to represent these connections. In short, any place that connected to the place Sardinia would now connect directly to all of the other places that connected to Sardinia, and the place marker of Sardinia would be eliminated from the network entirely. This resulted in a very interesting visualization where the density of network connections almost resembles a polygon.

more connect.png
Figure 11: Single mode network representation of Pleiades data

Even though I am still figuring out a method to transfer the color of the links from Gephi to QGIS, this type of representation has tremendous potential. If we can class different connections and pull those form the data set, we can begin to represent political areas, land masses, and other groupings as the sum of their shared connections in geographic space. So, instead of drawing arbitrary polygons, it is the connections themselves that create the “area” of a place. If these connections are able to respect underlying geography (roads, mountain passes, navigable rivers, springs, and other features), I think we may have a very powerful way of representing economic regions, areas of social interaction, political control, etc, and explore how those different networks interact and influence each other in geographic space.


Ancient Itineraries: The Digital Lives of Art History

I am very happy to announce that I have been chosen as a participant for Ancient Itineraries: The Digital Lives of Art History institute, which is supported by the Getty Foundation as part of their Digital Art History initiative.


A (VERY!) brief synopsis: The institute will focus on three areas of concern to digital art history: provenance, geographies, and visualization. We will create detailed specifications, assess different methodologies, and create a detailed proof of concept for each of these three areas. The results of this work will be translatable to different project plans and research opportunities at the close of the institute.


Given the detailed description of the institute (linked above) and the various specialties and strengths of the organizers, I think this will be a fascinating exploration of the intersection of art history, ancient history, linked data, geospatial research, material culture, and digital humanities. I expect that this institute will not only create outstanding scholarly output, but will serve as the core of a new, robust community of scholars interested in linked data, material culture, and art history.

SNA, Wikipedia, and the Hellenistic World

Part of my work on the Big Ancient Mediterranean project involves creating a general software framework that can display social networks produced with Gephi, either as “stand alone” displays or integrated with geographic and textual information.

I created this particular module, “Hellenistic” Royal Relationships, to highlight the “stand alone” social network analysis (SNA) capabilities of BAM, and to serve as the start of a more generalized Hellenistic prosopography. Some other, more specialized work has been done in this direction; notably Trismegistos Networks and the efforts of SNAP:DRGN to create data standards for describing prosopographies and linking to other projects. Eventually this module will take advantage of these efforts, and provide stable URIs for its own data.

I envision this module serving several purposes. First, it provides an interesting visual representation of data contained within Wikipedia articles, including textual data that is not “linked” to other entries  and therefore not discoverable by automated means. It serves as a quick reference for familial relationships, and provides an entry point for further exploration and study. This project has created a “core” of relationships that can be further expanded by different projects. It also can function as a check on Wikipedia data; some of the relationships here are highly controversial, or could even be wrong.

For future development, the next steps are to add more data on the subjects, including birth / death / reigning dates and a time-line browser based on those dates. As mentioned above, more work needs to be done to take advantage of linked data projects, including linkages to Pleiades locations where appropriate, linkages to Nomisma IDs if the monarch minted coins, and the presentation of the underlying data in a format that is compatible with SNAP:DRGN. Finally, I would like to develop a method for the automatic discovery and extraction of relationships described in Wikipedia articles, which is an interesting, but difficult, problem.

Asia Minor in the Second Century CE: A New Wall Map From the Ancient World Mapping Center

The Ancient World Mapping Center at UNC Chapel Hill has just released a 1:750,000 scale map of Roman Asia Minor in the Second Century CE under CC BY 4.0. Several years in the making, this map is a collaboration between several different directors of the center (including myself), domain experts, and other historians, and it represents the current state of knowledge about Roman Asia Minor in this period.

Intended for class or research use, the map can be printed, distributed digitally, or remixed as desired. It is the same scale and general size as the AWMC’s other wall map offerings through Routledge, so if you are so inclined, you can add it to a “mega-map” of the Mediterranean World. Demand for the map was so high that dropbox suspended our public folder; you can e-mail the AWMC (awmc@unc.edu) for a new direct download link.

Although this project is a static map of Asia Minor, the data behind the map can be found at the AWMC GitHub page. In a future post, I’ll write up how to use the AWMC geodata and the BAM framework to make an interactive version of this map which you can modify for your own needs.

#ReportHate, whywereafraid, and SNA

With increasing social media incidents of election-related violence on twitter and social media, I decided to perform a quick network analysis of #ReportHate and whywereafraid (which, as of this writing, has removed its twitter link from its site). I am interested in examining the development of these online communities, if there are significant overlaps between them, and if there are opportunities for increased cooperation.

The main component of the #ReportHate network. Dr. Singh’s community is in purple, the SPLC is in green, and the alt-right grouping is in red.

First, I looked at each network in isolation. I started with the network formed around #ReportHate, which consists of 2,781 nodes, 4,217 edges and 79 components. (A quick network primer: nodes are users or hashtags, while edges represent users mentioning a hashtag or another user. Components are parts of the graph where every node can trace a path through a number of edges to another node, and degree is the number of edges connecting a node to other nodes).

Surprisingly to me, the SPLC (@SPLCENTER) is not the node with the highest degree; that honor belongs to Dr. Simran Jeet Singh (@SIKHPROF), a professor of religion at Trinity University, despite SPLC’s approximate 9-1 advantage in followers (96.3 thousand to 10.7 thousand). It will be interesting to see if this disparity closes as more individuals are aware of the hashtag.

The top ten nodes by degree are dominated by two very different philosophies. @SIKHPROF, @SPLCENTER, @SHAUNKING, @AMYWESTERVELT, @TRUMPSWORLD2016, and @THIERISTAN are certainly aligned with progressive causes and appear to be supporters of the SPLC’s efforts to accurately report hate crimes. However, the next major node on the graph, @STOPHATECRIMEZ, appears to be an alt-right account (including an emoticon frog as a stand-in for Pepe the Frog), which tweets links to accounts of violence against Trump voters (dominated by links to YouTube) and refutations of violence committed by Trump supporters. The accounts that retweeted this account likewise seem to be dominated by alt-right and far right wing individuals, and the hashtag #HATECRIME is almost exclusively used by this group.

Moving on from the alt-right component of the graph, it is apparent that there are several large clusters of SPLC supporters that as of yet do not have much interconnectivity. As this is a relatively new hashtag, I expect a growth of connections between clusters; if not, there is is an opportunity for the “central” nodes of each cluster to reach out to each other and establish a more robust online community. Another potential issue are nodes that are otherwise disconnected from the network; if these individuals are tweeting about incidents, it would be beneficial to reach out (virtually) and bring them into the larger #ReportHate network.

Unlike the #ReportHate network, with a strong connected component, the whywereafraid network is far more dispersed and much smaller. There are 992 nodes and 938 edges, with 151 components. The node with the highest degree count is Patrick Kingsley (@PATRICKKINGSLEY), a foreign correspondent with the Guardian paper; his high degree is the result of his tweet linking to the whywereafraid tumblr account.

The whywereafraid network

The other two of the top three nodes, @ADAMPOWERS and @JAMIETWORKOWSKI, seem to be allied with the progressive movement. The next node with the highest degree is the official account of Donald Trump (@REALDONALDTRUMP). However, this is due to other twitter users castigating him over election violence.

I then placed the networks together, to see if there was any overlap between the two growing communities. There are 26 users and 19 hashtags in common; when the entire network is placed in a graph, the node with the node with the highest degree of the 26 is @SHAUNKING, who is mentioned four times by other uses to bring his attention to whywereafraid. There are other tentative connections, but for the most part the two networks are very distinct, with little cross conversation.

The combined network. Edges that are from the #ReportHate data are in red, edges from the whywereafraid data are in blue.

This represents a danger and an opportunity for the supporters of #ReportHate and whywereafraid. As the #ReportHate and whyweareafraid networks grow, there are likely to have increased links due to shared common interests, but there is the real possibility that many users will remain tied to their initial choice of hashtag, and not participate in the wider community or conversation. If nodes that are structurally important (a high betweenness centrality) in the #ReportHate graph, such as @SIKHPROF and @AMYWESTERVELT, could be brought into conversation with the major nodes of the whyweareafraid graph, then there is a good chance to merge the two networks, increasing awareness, mutual support, and an increased online presence.

#NoDAPL Twitter Analysis


Map By Carl Sack

The approximately 1,172 mile Dakota Access Pipeline1 has been highly controversial since its public unveiling in 2014.2 The Standing Rock Sioux and allied organizations took ultimately unsuccessful legal action to stop construction of the project3 while youth from the reservation began a social media campaign which gradually morphed into a larger movement with dozens of associated hashtags.4 I performed network analysis on #NODAPL, the most prominent of these hashtags on Twitter, between October 22 – 30, 2016. This revealed some interesting trends in the data, including the key role of alternative media, celebrities, and seemingly random twitter users holding the network together. Another surprising finding was the relatively minor role that republican candidate Donald Trump’s twitter account plays in the #NODAPL conversation, especially compared to the accounts of Barack Obama, Hilary Clinton, Bernie Sanders, and Dr. Jill Stein.

My Visualization of the #NODAPL network

Preliminary Network Analysis:

Due to restrictions from the Twitter API and crashes / limitations from the software (see below), I do not have complete access to all Tweet traffic involving #NODAPL.5 I used the Twitter Archiving Google Sheet (TAGS) 6.16 to capture tweets that featured #NODAPL somewhere the tweet text. The resulting sheets were then imported into a database, then exported into an edges table for use in Gephi. For technical details, see the “Detailed Procedure” section below.

Basic to any network analysis is the concept of nodes and edges. Nodes can represent people, places, things, ideas, etc – they are entities on the graph. In this case, nodes are twitter users and hashtags. Edges associate nodes in some manner; they can represent friendship, biological relationships, enmity, or anything else that links two nodes. For my analysis, edges are anytime a user includes a user name or hashtag in a tweet. For example, one of the most prominent users in this study, @RUTHHHOPKINS is represented as a node, with an edge created to the node #NODAPL every time she uses the hashtag in a tweet, like the example below:

# NODAPL itself was excluded as a node in this analysis, as every tweet and user would be directly connected to it. This network features 133,702 nodes linked by 630,393 edges.7 I used Gephi to identify communities of nodes that are strongly linked together, which are represented by different colors in the network visualization.8 In addition, I ran some basic network statistics, including measuring the degree of nodes (the number of edges between two individuals, hashtags, or individuals and hashtags) on the graph. In these measurements out-degree indicates that a node initiates a link to another node in the graph, which in this case means another user name or hashtag was mentioned in a text by the node in question. in-degree measures incoming edges, which indicates that a particular node is the subject of a twitter conversation.

I first looked at the in-degree measurement. #STANDINGROCK was by far the node with the highest in-degree, indicating its popularity as a potential alternative hashtag to #NODAPL. @POTUS, the official twitter account of the President of the United States, was in second place, followed by #WATERISLIFE, @HILLARYCLINTON, @OFFICIALJADEN, @UR_NINJA, @SHAILENEWOODLEY, @MARKRUFFALO, and @RUTHHHOPKINS. In this list, only two nodes are not politicians, hashtags, or celebrities. @UR_Ninja is the official twitter account of Unicorn Riot9, a 501(c)3 nonprofit organization based in Minneapolis, Minnesota10 which has done extensive reporting on the Dakota Access Pipeline protests. @RUTHHHOPKINS is the twitter account of Ruth Hopkins, a Dakota/Lakota Sioux writer, journalist, and blogger. The high degree count on these nodes indicates that they may function as an information service, where their reporting on the situation is retweeted and mentioned by many other nodes in the network.

This measurement also revealed a marked difference between the in-degree and out-degree of nodes. The top 34 nodes by number of degrees are so dominated by in-degree connections that no node has an out-degree that contains more than 3.17% of its total edges. This reveals that such nodes are being “talked at”: they are mentioned in tweets, retweeted in large numbers, but by and large feature extremely limited further engagement with other Twitter users.

A particular user group is indicative of this trend. Few politicians have used Twitter to actively engage with activists or to contribute to the dialogue surrounding the #NoDAPL movement. In some cases this is not surprising; the official twitter account of the President of the United States can scarcely be expected to contribute extensively to dialogue on twitter. Despite being the seventh highest degree node and an occupation of her Brooklyn campaign headquarters on October 27, 201611 @HILLARYCLINTON, the official account of Hillary Clinton, has likewise not responded to #NoDAPL conversations on twitter. The official account of Bernie Sanders, @SENSANDERS, has also not extensively engaged with #NODAPL. However, on October 31, 2016, which is outside of the bounds of my data set, his account did issue a series of tweets in support of the # NODAPL movement.12

Another account of a politician, Dr. Jill Stein (@DRJILLSTEIN), is twelfth on in-degree, but only has five outwardly directed edges. Despite active involvement at the protests leading to charges of criminal trespass and criminal mischief,13 Dr. Stein’s twitter account has barely engaged with other users, with the only mentions in this data set originating from a retweet that mentioned Hillary Clinton and Barack Obama.14 Interestingly, despite over 1,000 retweets (many of which were collected by this study), her tweet mentioning both Hillary Clinton and Donald Trump15 was not captured by the TAGS software.

Perhaps surprisingly for a major party candidate, the twitter handle of Donald Trump, @REALDONALDTRUMP, is an outlier on this list: he ranks at 112,160 with only 933 total mentions. Trump’s publicized investments and connections with the Dakota Access project16 and environmental positions, including discounting climate change,17 almost certainly makes him unlikely to be sympathetic, let alone an ally of the # NODAPL movement. Indeed, most of his mentions on the network are simply retweets of Dr. Jill Stein’s criticism against Donald Trump and Hillary Clinton’s lack of involvement in the pipeline issue.18

Drilling down further into the data, I next looked at the nodes with the highest out-degree, which represents nodes who mentioned other users and hashtags. There were some interesting variations from the trends of in-degree nodes. Three users, @DEANLEH, @CANATIVEOBT, and @WMN4SRVL had in-degree and out-degree measurements that were no more than 20% divergent from each other. However, this does not mean that these nodes are engaged in extensive online conversations. These accounts all feature extensive retweets and linkages to different causes often associated with the progressive movement, including climate change awareness, opposition to institutional racism, feminism, and anti-corporatism. All three of these accounts seem to perform a function similar to news aggregation, as the majority of their mentions are retweets from other sources and are not extensive discussions with other users.

Another useful statistics, betweenness, measures the number of shortest paths (connections between any two nodes on the graph that may involve any number of additional nodes) that pass through a specific node.19 Nodes with a high betweenness are “central” in that they play a critical role in connecting (and therefore moving information) through the network. The single node with the highest betweenness is @UR_NINJA, which combined with its high degree ranking, suggests that the news service plays a critical role in bringing together individuals on the graph who are interested in social justice / progressive issues. Four other nodes in the top 25 betweenness list are likewise in the top 25 nodes by degree.

The remaing nodes are somewhat surprising. The twitter profile for second highest betweenness node, @TNPMR has a limited online footprint outside of Twitter, and does not seem to be involved in a leadership capacity in a social movement or media organization. Another important node in this measurement, @AMAZONMILLER, only scores 1638th in total degrees, yet still retains an important place in the network structure. Looking further at this data, I next examined at each individual user’s Twitter profile who scored in the top 25 for betweenness. I divided this list into people who seem to be primarily interested in progressive causes in general vs. those who expressed affinity for indigenous rights issues. The results were nearly evenly split, with a slight edge to the more general progressivists. However, only two of top ten nodes in the betweenness category focused primarily on indigenous issues, while the rest were concerned with progressivist issues more broadly. What this may indicate is that, as a whole, indigenous activists may face future difficulties in promoting their narrative outside of the more general progressive interests of the online community.

Further Observations:

These preliminary steps have also revealed some issues about data collection and curation. Twitter’s REST and streaming APIs are woefully inadequate for examining the whole data set. While Twitter provides, in theory, a representative sample of the data set, one of the powers of social network analysis is the discovery of weak ties and other network structures which are by definition not representative of the network as a whole. This can be frustrating for academic study of the network, and extremely detrimental to movements that depend on social media to transmit their messages. Groups can look at their own twitter histories, but the larger network structure, along with crucial weak ties, may be invisible to them.

Although Twitter does provide mechanisms for obtaining the entire history of hashtag usage, the organic development of other hashtags which are not heavily watched from the beginning is almost certainly a cost-prohibitive proposition for social movements that are loosely organized, under-funded, and / or have limited computer infrastructure. It would be a significant benefit for such groups to gain access to the Twitter history of their movements, and be able to the evolution of the conversation on social media. As hashtag use can grow organically, with many different signifiers used for conversations, Twitter’s current pricing structure and data access model puts these groups at a severe disadvantage and hinders the identification and cultivation of allied communities and supporters.

A less pressing, but nevertheless important, issue is access to Twitter’s archive by researchers. Unlike print material or traditional media, which may be tedious to analyze but are fully (and for the most part cheaply) accessible to interested parties, the complete set of tweets on a topic are impossible to study without significant funding. Even if a researcher could guess all of the hashtags that could emerge from a dynamic topic, the Twitter streaming API does not provide all relevant tweets. Such limitations make it challenging to use Twitter data in a pedagogical setting. Some of my students have expressed interest in conducting similar projects, but the need for constant downstream connections and the high cost of historical tweets have made all but the most superficial studies impossible. There needs to be a more cost-effective means for projects operating on a limited budget, students, and other academic uses of Twitter’s data.

Next Steps:

In addition to the data set on #NoDAPL featured here, I have also compiled a number of hashtags and data in separate TAGS sheets which can be combined to see more of the network. I am currently running a python script to grab more tweet data from the streaming API, which should provide more tweets. After placing this data in the network and performing some basic sentiment analysis, I want to see if distinct communities have formed around different hashtags, and if those communities have noticeably different rhetorical strategies that correspond to the inclusion of certain hashtags. A long term goal is to secure funding to obtain the complete twitter archive of #NODAPL and related hashtags in order to perform a full social network and sentiment analysis. In addition, I would like to examine the twitter history of @UR_Ninja and other alternative news organizations to see if their followers form recognizable activist communities. As part of this analysis, I am especially interested to see how these communities change when news organization shifts their focus between causes (like #FERGUSON to #NODAPL), and to examine the interactions of these virtual communities with different social movements.

To overcome the issues I discovered with TAGS and TwitterStreamingImporter, I am currently running a python script (modeled after http://adilmoujahid.com/posts/2014/07/twitter-analytics/) that pulls in the full json object from Twitter’s streaming API for a number of hashtags related to #NODAPL. I think the best approach is to perform a weekly update of a “master” network that captures all of the data that I can dealing with #NODAPL, and then running statistics / etc from a filtered network in Gephi. I will be sure to post any additional developments here.

Detailed Procedure:

The first difficulty in analyzing Twitter traffic is actually obtaining Twitter data. While Twitter does retain a historical archive of all tweets, this resource is currently inaccessible for academic research unless licensing fees are paid to an archival service such as GNIP. There is an indication that GNIP is aware of the power of Twitter analytics for academic research, and there are different pricing plans available,20 but as my project is currently in the exploratory phase, I am operating without any funding. As such, I needed an alternative.

I first used TAGS to pull historical and incoming tweets into separate google sheets for each hashtag I was interested in. TAGS uses Twitter’s REST API, which limits search rates and results.21 I ran into rate limits rather quickly with my searches; in addition, my documents in google also hit their size and row limit. TAGS does not provide the entire result from the Twitter API: fields like place, retweeted (which indicates if a tweet was retweeted or not), and other useful fields are left off. Finally, I noticed that the text of tweets was often truncated; this made searching form complete user names, hashtags, and full text problematic. Although TAGS is a convent way to collect tweets, it can not possibly hope to represent the full network.

Despite these imitations, TAGS can still provide some powerful insights with a little modification. After importing my TAGS documents into a postgresql database, I mined the tweet text for all user mentions and hashtags from individual twitter users, which formed the edges of my network. I then imported this into Gephi v.0.9.122, where I performed some basic network analysis and visualizations of the data.

After this analysis, I decided that I needed to capture more tweets as they are issued. I used the TwitterStreamingImporter plugin for Gephi,23 which uses Twitter’s streaming API.24 The result is not all tweets that contain specified search terms, but is instead a representative sample that numbers up to 1% of global tweets. At ~ 300 -500 million tweets per day,25 the streaming api will return 3 – 5 million tweets on a given subject. For small data sets this may be sufficient, but it is impossible to tell how truly representative this sample is without the complete Twitter firehose.26

Unlike TAGS, TwitterStreamingImporter requires a constant internet connection to compile tweets. This is impracticable if not impossible for individuals who use a single laptop or other machine between different locations. I also experienced some crashes while performing analytics and changing/ running the visualization layouts; anyone wishing to style twitter data using this technique may wish to save constantly and export different files for styling purposes. This plugin does a nice job of drawing edges between users, tweets, and hashtags, and specifies the type of edge (tweet, retweet, hashtag, etc), although I would still like some more detailed information. The code is freely accessible,27 so I may be able to fork the repository and create a new plugin that pulls in all the data that I am interested in (especially geolocations, time of the tweet, etc). However, I think simply using a python script on a persistent connection will be my next step in this analysis. 


1  LLC Dakota Access and United States Army Corps of Engineers, “Environmental Assessment: Dakota Access Pipeline Project, Crossings of Flowage Easements and Federal Lands” (U.S. Army Corps of Engineers, Omaha District, 2016), 8, http://purl.fdlp.gov/GPO/gpo74064.

2  Further reading can be found at https://nycstandswithstandingrock.wordpress.com/standingrocksyllabus/, created by NYC Stands for Standing Rock committee, a self described group “…group of Indigenous scholars and activists, and settler/ POC supporters.” (https://nycstandswithstandingrock.wordpress.com/about/).

3  A. B. C. News, “Court Denies Tribe’s Appeal to Block Dakota Access Pipeline,” ABC News, October 11, 2016, http://abcnews.go.com/US/court-denies-tribes-appeal-block-controversial-dakota-access/story?id=42700614.

4  “Rezpect Our Water,” accessed November 6, 2016, http://rezpectourwater.com/; “Thousands Nationwide Show Solidarity with the Standing Rock Sioux and #NoDAPL,” Sierra Club, September 13, 2016, http://www.sierraclub.org/planet/2016/09/thousands-nationwide-show-solidarity-standing-rock-sioux-and-nodapl.

5  n.b. Twitter is case insensitive, but all user names and hashtags are capitalized here.

7  I used Gephi with the OpenOrd Layout to create the network visualization1 after modifying TAGS data in a postgresql database. Although the OpenOrd layout is intended for undirected graphs (see https://marketplace.gephi.org/plugin/openord-layout/), its ability to handle large datasets and limited computing resources made it an attractive choice for this investigation.

8  The modularity for the graph is 0.414, with 862 communities detected. 32 of these communities had 100 or more nodes, and totaled 131,080 of the 133,702 total, which is 98.04% of the total.

10  “About,” Unicorn Riot, accessed October 30, 2016, http://www.unicornriot.ninja/?page_id=372.

11  The Root Staff, “#NoDAPL: Indigenous Youths Occupy Hillary Clinton’s Brooklyn, NY, Headquarters,” The Root, October 29, 2016, http://www.theroot.com/articles/news/2016/10/nodapl-indigenous-youth-occupy-hillary-clintons-brooklyn-headquarters/; “Indigenous Youth Occupy Hillary Clinton Campaign Headquarters to Demand She Take Stand on #DAPL,” Democracy Now!, accessed November 4, 2016, http://www.democracynow.org/2016/10/28/indigenous_youth_occupy_hillary_clinton_campaign.

16  Oliver Milman, “Dakota Access Pipeline Company and Donald Trump Have Close Financial Ties,” The Guardian, October 26, 2016, sec. US news, https://www.theguardian.com/us-news/2016/oct/26/donald-trump-dakota-access-pipeline-investment-energy-transfer-partners; “The Latest: Trump Holds Dakota Access Pipeline Company Stock,” US News & World Report, accessed November 4, 2016, http://www.usnews.com/news/us/articles/2016-10-26/the-latest-pipeline-protesters-think-their-removal-imminent.

17  “Did Trump Say Climate Change Was a Chinese Hoax?,” @politifact, accessed November 4, 2016, http://www.politifact.com/truth-o-meter/statements/2016/jun/03/hillary-clinton/yes-donald-trump-did-call-climate-change-chinese-h/.

19  I performed Eigenvector analysis on the data set, but there was little deviation in the top ranked nodes from ranking by total degree.

25  Jim Edwards, “Leaked Twitter API Data Shows the Number of Tweets Is in Serious Decline,” Business Insider, accessed November 2, 2016, http://www.businessinsider.com/tweets-on-twitter-is-in-serious-decline-2016-2; “Twitter Usage Statistics – Internet Live Stats,” accessed November 2, 2016, http://www.internetlivestats.com/twitter-statistics/#sources.

26  Research on the representative accuracy of Twitter’s API has been mixed; see Fred Morstatter et al., “Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose,” arXiv Preprint arXiv:1306.5204, 2013; Fred Morstatter, Jürgen Pfeffer, and Huan Liu, “When Is It Biased?: Assessing the Representativeness of Twitter’s Streaming API,” in Proceedings of the 23rd International Conference on World Wide Web (New York, NY: ACM, 2014).


“A Pipeline Fight and America’s Dark Past.” The New Yorker, September 6, 2016. http://www.newyorker.com/news/daily-comment/a-pipeline-fight-and-americas-dark-past.

“About.” NYC Stands with Standing Rock, September 13, 2016. https://nycstandswithstandingrock.wordpress.com/about/.

“About.” Unicorn Riot. Accessed October 30, 2016. http://www.unicornriot.ninja/?page_id=372.

“Appeals Court Halts Dakota Access Pipeline Work Pending Hearing.” Indianz. Accessed November 6, 2016. http://www.indianz.com/News/2016/09/16/appeals-court-halts-dakota-access-pipeli.asp.

CNN, Marlena Baldacci, Emanuella Grinberg and Holly Yan. “Dakota Access Pipeline: Police Remove Protesters.” CNN. Accessed November 6, 2016. http://www.cnn.com/2016/10/27/us/dakota-access-pipeline-protests/index.html.

Dakota Access, LLC, and United States Army Corps of Engineers. “Environmental Assessment: Dakota Access Pipeline Project, Crossings of Flowage Easements and Federal Lands.” U.S. Army Corps of Engineers, Omaha District, 2016. http://purl.fdlp.gov/GPO/gpo74064.

“Dakota Access Pipeline.” Accessed November 6, 2016. http://www.daplpipelinefacts.com/.

“Dakota Access Pipeline: Overview.” Accessed November 6, 2016. http://www.daplpipelinefacts.com/about/overview.html.

“Did Trump Say Climate Change Was a Chinese Hoax?” @politifact. Accessed November 4, 2016. http://www.politifact.com/truth-o-meter/statements/2016/jun/03/hillary-clinton/yes-donald-trump-did-call-climate-change-chinese-h/.

Edwards, Jim. “Leaked Twitter API Data Shows the Number of Tweets Is in Serious Decline.” Business Insider. Accessed November 2, 2016. http://www.businessinsider.com/tweets-on-twitter-is-in-serious-decline-2016-2.

Healy, Jack. “From 280 Tribes, a Protest on the Plains.” The New York Times, September 11, 2016. http://www.nytimes.com/interactive/2016/09/12/us/12tribes.html.

“Indigenous Youth Occupy Hillary Clinton Campaign Headquarters to Demand She Take Stand on #DAPL.” Democracy Now! Accessed November 4, 2016. http://www.democracynow.org/2016/10/28/indigenous_youth_occupy_hillary_clinton_campaign.

“Judge Rules That Construction Can Proceed On Dakota Access Pipeline.” NPR.org. Accessed November 6, 2016. http://www.npr.org/sections/thetwo-way/2016/09/09/493280504/judge-rules-that-construction-can-proceed-on-dakota-access-pipeline.

“Life in the Native American Oil Protest Camps.” BBC News, September 2, 2016, sec. US & Canada. http://www.bbc.com/news/world-us-canada-37249617.

McCausland, Phil. “More Than 80 Dakota Pipeline Protesters Arrested, Some Pepper Sprayed.” NBC News, October 23, 2016. http://www.nbcnews.com/news/us-news/more-80-dakota-access-pipeline-protesters-arrested-some-pepper-sprayed-n671281.

McCleary, Mike. “As Standing Rock Protesters Face Down Armored Trucks, the World Watches on Facebook.” WIRED. Accessed October 30, 2016. https://www.wired.com/2016/10/standing-rock-protesters-face-police-world-watches-facebook/.

Milman, Oliver. “Dakota Access Pipeline Company and Donald Trump Have Close Financial Ties.” The Guardian, October 26, 2016, sec. US news. https://www.theguardian.com/us-news/2016/oct/26/donald-trump-dakota-access-pipeline-investment-energy-transfer-partners.

Morstatter, Fred, Jürgen Pfeffer, and Huan Liu. “When Is It Biased?: Assessing the Representativeness of Twitter’s Streaming API.” In Proceedings of the 23rd International Conference on World Wide Web. New York, NY: ACM, 2014.

Morstatter, Fred, Jürgen Pfeffer, Huan Liu, and Kathleen M Carley. “Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose.” arXiv Preprint arXiv:1306.5204, 2013.

News, A. B. C. “Court Denies Tribe’s Appeal to Block Dakota Access Pipeline.” ABC News, October 11, 2016. http://abcnews.go.com/US/court-denies-tribes-appeal-block-controversial-dakota-access/story?id=42700614.

———. “Timeline of the Dakota Access Pipeline Protests.” ABC News, October 31, 2016. http://abcnews.go.com/US/timeline-dakota-access-pipeline-protests/story?id=43131355.

“Rezpect Our Water.” Accessed November 6, 2016. http://rezpectourwater.com/.

Staff, The Root. “#NoDAPL: Indigenous Youths Occupy Hillary Clinton’s Brooklyn, NY, Headquarters.” The Root, October 29, 2016. http://www.theroot.com/articles/news/2016/10/nodapl-indigenous-youth-occupy-hillary-clintons-brooklyn-headquarters/.

“The Digital Transition: How the Presidential Transition Works in the Social Media Age.” Whitehouse.gov, October 31, 2016. https://www.whitehouse.gov/blog/2016/10/31/digital-transition-how-presidential-transition-works-social-media-age.

“The Latest: Trump Holds Dakota Access Pipeline Company Stock.” US News & World Report. Accessed November 4, 2016. http://www.usnews.com/news/us/articles/2016-10-26/the-latest-pipeline-protesters-think-their-removal-imminent.

“Thousands Nationwide Show Solidarity with the Standing Rock Sioux and #NoDAPL.” Sierra Club, September 13, 2016. http://www.sierraclub.org/planet/2016/09/thousands-nationwide-show-solidarity-standing-rock-sioux-and-nodapl.

“Twitter Usage Statistics – Internet Live Stats.” Accessed November 2, 2016. http://www.internetlivestats.com/twitter-statistics/#sources.

Williams, Weston. “Standing Rock Protests Escalate, as Tribe Calls for DOJ to Investigate.” Christian Science Monitor, October 24, 2016. http://www.csmonitor.com/USA/Justice/2016/1024/Standing-Rock-protests-escalate-as-tribe-calls-for-DOJ-to-investigate.