{"id":409,"date":"2025-02-08T22:49:00","date_gmt":"2025-02-08T21:49:00","guid":{"rendered":"https:\/\/bowfinger.de\/blog\/?p=409"},"modified":"2025-02-10T17:08:05","modified_gmt":"2025-02-10T16:08:05","slug":"massive-nextcloud-log-file-quickly-analysed-using-python","status":"publish","type":"post","link":"https:\/\/bowfinger.de\/blog\/2025\/02\/massive-nextcloud-log-file-quickly-analysed-using-python\/","title":{"rendered":"Massive Nextcloud log file quickly analysed using Python"},"content":{"rendered":"\n<p>I ran into a problem with quite a buggy Nextcloud instance on a host with limited quota. The Nextcloud log file would baloon at a crazy rate. So at one point, I snatched a 700 MB sample (yeah, that took maybe an hour or so) and wondered: what&#8217;s wrong?<\/p>\n\n\n\n<p>So, first things first: Nextcloud&#8217;s log files are JSON files. Which makes them excruciatingly difficult to read. Okay, better than binary, but still, not an eye pleaser. They wouldn&#8217;t be easy to <code>grep<\/code> either. So, Python to the rescue as it has the <code>json<\/code> module*.<\/p>\n\n\n\n<p>First, using <code>head<\/code> I looked at the first 10 lines only. Why? Because I had no idea of the performance of this little script of mine and I wanted to check it out first.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash\">head -n 10 nextcloud.log > nextcloud.log.10<\/code><\/pre>\n\n\n\n<p>Because these logs are scattered with user and directory names and specifics of that particular Nextcloud instance (it&#8217;ll be NC from here on), I won&#8217;t share any of them here. Sorry. But if you have NC yourself, just get it from the <code>\/data\/<\/code> directory of your NC instance.<\/p>\n\n\n\n<p>I found each line to contain one JSON object (enclosed in curly brackets). So, let&#8217;s read this line-by-line and feed it into Python&#8217;s JSON parser:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python\">import json\n\nwith open(\"nextcloud.log.10\", \"r\") as fh:\n    for line in fh:\n        data = json.loads(line)<\/code><\/pre>\n\n\n\n<p>At this point, you can already get an idea of how long each line is processed. If you&#8217;re using Jupyter Notebook, you can place the <code>with<\/code> statement into its own cell and simply use the <code>%%timeit<\/code> cell magic for a good first impression. On my machine it says<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">592 \u00b5s \u00b1 7.65 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)<\/pre>\n\n\n\n<p>which is okay: roughly 60&nbsp;\u00b5s per line.<\/p>\n\n\n\n<p>Next, I wanted to inspect a few lines and make reading easier: pretty print, or <code>pprint<\/code> as its module is called, to the rescue!<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python\">from pprint import pprint\n\npprint(data)<\/code><\/pre>\n\n\n\n<p>This pretty prints the last line. If you want to access all 10 lines, create for instance an empty array <code>data_lines<\/code> first and do <code>data_lines.append(data)<\/code> inside the <code>for<\/code> loop.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python\">{'reqId': '&lt;redacted>',\n 'level': 2,\n 'time': '2025-02-06&lt;redacted>',\n 'remoteAddr': '&lt;redacted>',\n 'user': '&lt;redacted>',\n 'app': 'no app in context',\n 'method': 'GET',\n 'url': '\/&lt;redacted>\/apps\/user_status\/api\/&lt;redacted>?format=json',\n 'message': 'Temporary directory \/www\/htdocs\/&lt;redacted>\/tmp\/ is not present or writable',\n 'userAgent': 'Mozilla\/5.0 (Linux) &lt;redacted> (Nextcloud, &lt;redacted>)',\n 'version': '&lt;redacted>',\n 'data': []}<\/code><\/pre>\n\n\n\n<p>Okay, there is a <code>message<\/code> which might be interesting, but I found another one:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python\">{'reqId': '',\n'level': 0,\n'time': '2025-02-06T',\n'remoteAddr': '',\n'user': '',\n'app': 'no app in context',\n'method': 'PROPFIND',\n'url': '\/\/',\n'message': 'Calling without parameters is deprecated and will throw soon.',\n'userAgent': 'Mozilla\/5.0 (Linux) (Nextcloud, 4)',\n'version': '',\n'exception': {'Exception': 'Exception',\n   'Message': 'No parameters in call to ',\n    \u2026<\/code><\/pre>\n\n\n\n<p>Now, this is much more interesting: It contains a key <code>exception<\/code> with a message and a long traceback below.<\/p>\n\n\n\n<p>I simply want to know:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>How many of these exceptions are there?<\/li>\n\n\n\n<li>How many unique messages are there?<\/li>\n<\/ul>\n\n\n\n<p>In other words: Is this a clusterfuck, or can I get this thing silent by fixing a handful of things?<\/p>\n\n\n\n<p>So, the idea is simple:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Read each line.<\/li>\n\n\n\n<li>Check if the line contains an <code>exception<\/code> keyword.<\/li>\n\n\n\n<li>In that case, count it and&#8230;<\/li>\n\n\n\n<li>&#8230; append the corresponding message to a <code>list<\/code>.<\/li>\n\n\n\n<li>Finally, convert that <code>list<\/code> into a <code>set<\/code>.<\/li>\n<\/ol>\n\n\n\n<p>And here is how this looks in Python:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python\">import json\nfrom pprint import pprint\n\nlines = 0\nexceptions = 0\nex_messages = []\n\nwith open(\"nextcloud.log\", \"r\") as fh:\n    for line in fh:\n        lines += 1\n        data = json.loads(line)\n        \n        if \"exception\" in data.keys():\n            exceptions += 1\n            msg = data[\"exception\"][\"Message\"]\n            ex_messages.append(msg)\n\nprint(f\"{lines:d} read, {exceptions:d} exceptions.\")\n\ns_ex_msg = set(ex_messages)\nprint(f\"{len(s_ex_msg):d} unique message types.\")\n\npprint(s_ex_msg)<\/code><\/pre>\n\n\n\n<p>I had<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">37460 read, 32537 exceptions.<br>22 unique message types.<\/pre>\n\n\n\n<p>That&#8217;s a lot of exceptions but a surprisingly small number of unique messages, i.e. possible individual causes.<\/p>\n\n\n\n<p>In my case, it mainly showed me what I knew beforehand: The database was a total mess.<\/p>\n\n\n\n<p>But see what you find.<\/p>\n\n\n\n<p><strong><em>Exercise<\/em><\/strong>: See how you need to modify the script to count <em>how many<\/em> out of the 32537 exceptions correspond to each of the 22 unique messages. And toot about it.<\/p>\n\n\n\n<p>*) I wonder if people will come and propose to use <code>simplejson<\/code>, as I&#8217;ve read in the wild, because &#8220;it&#8217;s faster!!!&#8221;. Use <code>%%timeit<\/code> to find out. Anything else is Mumpitz (forum voodoo).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I ran into a problem with quite a buggy Nextcloud instance on a host with limited quota. The Nextcloud log file would baloon at a crazy rate. So at one point, I snatched a 700 MB sample (yeah, that took maybe an hour or so) and wondered: what&#8217;s wrong? So, first things first: Nextcloud&#8217;s log&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[16,35],"tags":[18,99,12],"class_list":["post-409","post","type-post","status-publish","format-standard","hentry","category-linux","category-python","tag-linux","tag-nextcloud","tag-python"],"_links":{"self":[{"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/posts\/409","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/comments?post=409"}],"version-history":[{"count":8,"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/posts\/409\/revisions"}],"predecessor-version":[{"id":417,"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/posts\/409\/revisions\/417"}],"wp:attachment":[{"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/media?parent=409"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/categories?post=409"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bowfinger.de\/blog\/wp-json\/wp\/v2\/tags?post=409"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}